Submitted by:
| # | Name | Id | |
|---|---|---|---|
| Student 1 | Alon Moses | 308177815 | alon.moses@post.idc.ac.il |
| Student 2 | Guy Attia | 305743437 | guy.attia@post.idc.ac.il |
In this assignment we'll create a from-scratch implementation of two fundemental deep learning concepts: the backpropagation algorithm and stochastic gradient descent-based optimizers. Following that we will focus on convolutional networks with residual blocks. We'll use PyTorch to create our own network architectures and train them using GPUs, and we'll conduct architecture experiments to determine the the effects of different architectural decisions on the performance of deep networks.
1. when you create the networks, please use pytorch blocks and not your layers and optimizers implementations! that way you can work on diffrent parts that do not depand on each other, and if you have a bug it will be much faster to find. If there is dependencies that we've missed around the notebooks, please notify us via the piazza so we can provide a workaround.
2. Due to previus homework, some of the tasks here became a bonus, try to get to them last if there is a dependenciy of a bonus section, please use the pytorch library implementation for the bonus, so you can continue without it
hw1, hw2, etc).
You can of course use any editor or IDE to work on these files.In this part we will learn about backpropagation and automatic differentiation. We'll implement both of these concepts from scratch and compare our implementation to PyTorch's built in implementation (autograd).
import torch
import unittest
%load_ext autoreload
%autoreload 2
test = unittest.TestCase()
The backpropagation algorithm is at the core of training deep models. To state the problem we'll tackle in this notebook, imagine we have an L-layer MLP model, defined as $$ \hat{\vec{y}^i} = \vec{y}_L^i= \varphi_L \left( \mat{W}L \varphi{L-1} \left( \cdots \varphi_1 \left( \mat{W}_1 \vec{x}^i + \vec{b}_1 \right) \cdots \right)
a pointwise loss function $\ell(\vec{y}, \hat{\vec{y}})$ and an empirical loss over our entire data set, $$ L(\vec{\theta}) = \frac{1}{N} \sum_{i=1}^{N} \ell(\vec{y}^i, \hat{\vec{y}^i}) + R(\vec{\theta}) $$
where $\vec{\theta}$ is a vector containing all network parameters, e.g. $\vec{\theta} = \left[ \mat{W}_{1,:}, \vec{b}_1, \dots, \mat{W}_{L,:}, \vec{b}_L \right]$.
In order to train our model we would like to calculate the derivative (or gradient, in the multivariate case) of the loss with respect to each and every one of the parameters, i.e. $\pderiv{L}{\mat{W}_j}$ and $\pderiv{L}{\vec{b}_j}$ for all $j$. Since the gradient "points" to the direction of functional increase, the negative gradient is often used as a descent direction for descent-based optimization algorithms. In other words, iteratively updating each parameter proportianally to it's negetive gradient can lead to convergence to a local minimum of the loss function.
Calculus tells us that as long as we know the derivatives of all the functions "along the way" ($\varphi_i(\cdot),\ \ell(\cdot,\cdot),\ R(\cdot)$) we can use the chain rule to calculate the derivative of the loss with respect to any one of the parameter vectors. Note that if the loss $L(\vec{\theta})$ is scalar (which is usually the case), the gradient of a parameter will have the same shape as the parameter itself (matrix/vector/tensor of same dimensions).
For deep models that are a composition of many functions, calculating the gradient of each parameter by hand and implementing hard-coded gradient derivations quickly becomes infeasible. Additionally, such code makes models hard to change, since any change potentially requires re-derivation and re-implementation of the entire gradient function.
The backpropagation algorithm, which we saw in the lecture, provides us with a effective method of applying the chain rule recursively so that we can implement gradient calculations of arbitrarily deep or complex models.
We'll now implement backpropagation using a modular approach, which will allow us to chain many components layers together and get automatic gradient calculation of the output with respect to the input or any intermediate parameter.
To do this, we'll define a Layer class. Here's the API of this class:
import hw2.layers as layers
help(layers.Layer)
Help on class Layer in module hw2.layers:
class Layer(abc.ABC)
| A Layer is some computation element in a network architecture which
| supports automatic differentiation using forward and backward functions.
|
| Method resolution order:
| Layer
| abc.ABC
| builtins.object
|
| Methods defined here:
|
| __call__(self, *args, **kwargs)
| Call self as a function.
|
| __init__(self)
| Initialize self. See help(type(self)) for accurate signature.
|
| __repr__(self)
| Return repr(self).
|
| backward(self, dout)
| Computes the backward pass of the layer, i.e. the gradient
| calculation of the final network output with respect to each of the
| parameters of the forward function.
| :param dout: The gradient of the network with respect to the
| output of this layer.
| :return: A tuple with the same number of elements as the parameters of
| the forward function. Each element will be the gradient of the
| network output with respect to that parameter.
|
| forward(self, *args, **kwargs)
| Computes the forward pass of the layer.
| :param args: The computation arguments (implementation specific).
| :return: The result of the computation.
|
| params(self)
| :return: Layer's trainable parameters and their gradients as a list
| of tuples, each tuple containing a tensor and it's corresponding
| gradient tensor.
|
| train(self, training_mode=True)
| Changes the mode of this layer between training and evaluation (test)
| mode. Some layers have different behaviour depending on mode.
| :param training_mode: True: set the model in training mode. False: set
| evaluation mode.
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| __abstractmethods__ = frozenset({'backward', 'forward', 'params'})
In other words, a Layer can be anything: a layer, an activation function, a loss function or generally any computation that we know how to derive a gradient for.
Each block must define a forward() function and a backward() function.
forward() function performs the actual calculation/operation of the block and returns an output.backward() function computes the gradient of the input and parameters as a function of the gradient of the output, according to the chain rule.Here's a diagram illustrating the above explanation:

Note that the diagram doesn't show that if the function is parametrized, i.e. $f(\vec{x},\vec{y})=f(\vec{x},\vec{y};\vec{w})$, there are also gradients to calculate for the parameters $\vec{w}$.
The forward pass is straightforward: just do the computation. To understand the backward pass, imagine that there's some "downstream" loss function $L(\vec{\theta})$ and magically somehow we are told the gradient of that loss with respect to the output $\vec{z}$ of our block, i.e. $\pderiv{L}{\vec{z}}$.
Now, since we know how to calculate the derivative of $f(\vec{x},\vec{y};\vec{w})$, it means we know how to calculate $\pderiv{\vec{z}}{\vec{x}}$, $\pderiv{\vec{z}}{\vec{y}}$ and $\pderiv{\vec{z}}{\vec{w}}$ . Thanks to the chain rule, this is all we need to calculate the gradients of the loss w.r.t. the input and parameters:
$$ \begin{align} \pderiv{L}{\vec{x}} &= \pderiv{L}{\vec{z}}\cdot \pderiv{\vec{z}}{\vec{x}}\\ \pderiv{L}{\vec{y}} &= \pderiv{L}{\vec{z}}\cdot \pderiv{\vec{z}}{\vec{y}}\\ \pderiv{L}{\vec{w}} &= \pderiv{L}{\vec{z}}\cdot \pderiv{\vec{z}}{\vec{w}} \end{align} $$PyTorch has the nn.Module base class, which may seem to be similar to our Layer since it also represents a computation element in a network.
However PyTorch's nn.Modules don't compute the gradient directly, they only define the forward calculations.
Instead, PyTorch has a more low-level API for defining a function and explicitly implementing it's forward() and backward(). See autograd.Function.
When an operation is performed on a tensor, it creates a Function instance which performs the operation and
stores any necessary information for calculating the gradient later on. Additionally, Functionss point to the
other Function objects representing the operations performed earlier on the tensor. Thus, a graph (or DAG)
of operations is created (this is not 100% exact, as the graph is actually composed of a different type of class which wraps the backward method, but it's accurate enough for our purposes).
A Tensor instance which was created by performing operations on one or more tensors with requires_grad=True, has a grad_fn property which is a Function instance representing the last operation performed to produce this tensor.
This exposes the graph of Function instances, each with it's own backward() function. Therefore, in PyTorch the backward() function is called on the tensors, not the modules.
Our Layers are therefore a combination of the ideas in Module and Function and we'll implement them together,
just to make things simpler.
Our goal here is to create a "poor man's autograd": We'll use PyTorch tensors,
but we'll calculate and store the gradients in our Layers (or return them).
The gradients we'll calculate are of the entire block, not individual operations on tensors.
To test our implementation, we'll use PyTorch's autograd.
Note that of course this method of tracking gradients is much more limited than what PyTorch offers. However it allows us to implement the backpropagation algorithm very simply and really see how it works.
Let's set up some testing instrumentation:
from hw2.grad_compare import compare_layer_to_torch
def test_block_grad(block: layers.Layer, x, y=None, delta=1e-3):
diffs = compare_layer_to_torch(block, x, y)
# Assert diff values
for diff in diffs:
test.assertLess(diff, delta)
# Show the compare function
compare_layer_to_torch??
Notes:
compare_layer_to_torch() function. It will help you understand what PyTorch is doing.delta above is should not be needed. A correct implementation will give you a diff of exactly zero.We'll now implement some Layers that will enable us to later build an MLP model of arbitrary depth, complete with automatic differentiation.
For each block, you'll first implement the forward() function.
Then, you will calculate the derivative of the block by hand with respect to each of its
input tensors and each of its parameter tensors (if any).
Using your manually-calculated derivation, you can then implement the backward() function.
Notice that we have intermediate Jacobians that are potentially high dimensional tensors. For example in the expression $\pderiv{L}{\vec{w}} = \pderiv{L}{\vec{z}}\cdot \pderiv{\vec{z}}{\vec{w}}$, the term $\pderiv{\vec{z}}{\vec{w}}$ is a 4D Jacobian if both $\vec{z}$ and $\vec{w}$ are 2D matrices.
In order to implement the backpropagation algorithm efficiently, we need to implement every backward function without explicitly constructing this Jacobian. Instead, we're interested in directly calculating the vector-Jacobian product (VJP) $\pderiv{L}{\vec{z}}\cdot \pderiv{\vec{z}}{\vec{w}}$. In order to do this, you should try to figure out the gradient of the loss with respect to one element, e.g. $\pderiv{L}{\vec{w}_{1,1}}$ and extrapolate from there how to directly obtain the VJP.
ReLU, or rectified linear unit is a very common activation function in deep learning architectures. In it's most standard form, as we'll implement here, it has no parameters.
We'll first implement the "leaky" version, defined as
$$ \mathrm{relu}(\vec{x}) = \max(\alpha\vec{x},\vec{x}), \ 0\leq\alpha<1 $$This is similar to the ReLU activation we've seen in class, only that it has a small non-zero slope then it's input is negative. Note that it's not strictly differentiable, however it has sub-gradients, defined separately any positive-valued input and for negative-valued input.
TODO: Complete the implementation of the LeakyReLU class in the hw2/layers.py module.
N = 100
in_features = 200
num_classes = 10
eps = 1e-6
# Test LeakyReLU
alpha = 0.1
lrelu = layers.LeakyReLU(alpha=alpha)
x_test = torch.randn(N, in_features)
# Test forward pass
z = lrelu(x_test)
test.assertSequenceEqual(z.shape, x_test.shape)
test.assertTrue(torch.allclose(z, torch.nn.LeakyReLU(alpha)(x_test), atol=eps))
# Test backward pass
test_block_grad(lrelu, x_test)
Comparing gradients... input diff=0.000
/Users/guyattia/PycharmProjects/MSC-DL-Course/hw2/hw2/layers.py:104: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). x_positive = torch.tensor(dout * positive_grad) /Users/guyattia/PycharmProjects/MSC-DL-Course/hw2/hw2/layers.py:105: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). x_negative = torch.tensor(dout * negative_grad)
Now using the LeakyReLU, we can trivially define a regular ReLU block as a special case.
TODO: Complete the implementation of the ReLU class in the hw2/layers.py module.
# Test ReLU
relu = layers.ReLU()
x_test = torch.randn(N, in_features)
# Test forward pass
z = relu(x_test)
test.assertSequenceEqual(z.shape, x_test.shape)
test.assertTrue(torch.allclose(z, torch.relu(x_test), atol=eps))
# Test backward pass
test_block_grad(relu, x_test)
Comparing gradients... input diff=0.000
The sigmoid function $\sigma(x)$ is also sometimes used as an activation function. We have also seen it previously in the context of logistic regression.
The sigmoid function is defined as
$$ \sigma(\vec{x}) = \frac{1}{1+\exp(-\vec{x})}. $$# Test Sigmoid
sigmoid = layers.Sigmoid()
x_test = torch.randn(N, in_features, in_features) # 3D input should work
# Test forward pass
z = sigmoid(x_test)
test.assertSequenceEqual(z.shape, x_test.shape)
test.assertTrue(torch.allclose(z, torch.sigmoid(x_test), atol=eps))
# Test backward pass
test_block_grad(sigmoid, x_test)
Comparing gradients... input diff=0.000
The hyperbolic tangent function $\tanh(x)$ is a common activation function used when the output should be in the range [-1, 1].
The tanh function is defined as
$$ \tanh(\vec{x}) = \frac{\exp(x)-\exp(-x)}{\exp(x)+\exp(-\vec{x})}. $$# Test TanH
tanh = layers.TanH()
x_test = torch.randn(N, in_features, in_features) # 3D input should work
# Test forward pass
z = tanh(x_test)
test.assertSequenceEqual(z.shape, x_test.shape)
test.assertTrue(torch.allclose(z, torch.tanh(x_test), atol=eps))
# Test backward pass
test_block_grad(tanh, x_test)
Comparing gradients... input diff=0.000
First, we'll implement an affine transform layer, also known as a fully connected layer.
Given an input $\mat{X}$ the layer computes,
$$ \mat{Z} = \mat{X} \mattr{W} + \vec{b} ,~ \mat{X}\in\set{R}^{N\times D_{\mathrm{in}}},~ \mat{W}\in\set{R}^{D_{\mathrm{out}}\times D_{\mathrm{in}}},~ \vec{b}\in\set{R}^{D_{\mathrm{out}}}. $$Notes:
TODO: Complete the implementation of the Linear class in the hw2/layers.py module.
# Test Linear
out_features = 1000
fc = layers.Linear(in_features, out_features)
x_test = torch.randn(N, in_features)
# Test forward pass
z = fc(x_test)
test.assertSequenceEqual(z.shape, [N, out_features])
torch_fc = torch.nn.Linear(in_features, out_features,bias=True)
torch_fc.weight = torch.nn.Parameter(fc.w)
torch_fc.bias = torch.nn.Parameter(fc.b)
test.assertTrue(torch.allclose(torch_fc(x_test), z, atol=eps))
# Test backward pass
test_block_grad(fc, x_test)
# Test second backward pass
x_test = torch.randn(N, in_features)
z = fc(x_test)
z = fc(x_test)
test_block_grad(fc, x_test)
Comparing gradients... input diff=0.000 param#01 diff=0.000 param#02 diff=0.000 Comparing gradients... input diff=0.000 param#01 diff=0.000 param#02 diff=0.000
As you know by know, cross-entropy is a common loss function for classification tasks. In class, we defined it as
$$\ell_{\mathrm{CE}}(\vec{y},\hat{\vec{y}}) = - {\vectr{y}} \log(\hat{\vec{y}})$$where $\hat{\vec{y}} = \mathrm{softmax}(x)$ is a probability vector (the output of softmax on the class scores $\vec{x}$) and the vector $\vec{y}$ is a 1-hot encoded class label.
However, it's tricky to compute the gradient of softmax, so instead we'll define a version of cross-entropy that produces the exact same output but works directly on the class scores $\vec{x}$.
We can write, $$\begin{align} \ell_{\mathrm{CE}}(\vec{y},\hat{\vec{y}}) &= - {\vectr{y}} \log(\hat{\vec{y}}) = - {\vectr{y}} \log\left(\mathrm{softmax}(\vec{x})\right) \\ &= - {\vectr{y}} \log\left(\frac{e^{\vec{x}}}{\sum_k e^{x_k}}\right) \\ &= - \log\left(\frac{e^{x_y}}{\sum_k e^{x_k}}\right) \\ &= - \left(\log\left(e^{x_y}\right) - \log\left(\sum_k e^{x_k}\right)\right)\\ &= - x_y + \log\left(\sum_k e^{x_k}\right) \end{align}$$
Where the scalar $y$ is the correct class label, so $x_y$ is the correct class score.
Note that this version of cross entropy is also what's provided by PyTorch's nn module.
TODO: Complete the implementation of the CrossEntropyLoss class in the hw2/layers.py module.
# Test CrossEntropy
cross_entropy = layers.CrossEntropyLoss()
scores = torch.randn(N, num_classes)
labels = torch.randint(low=0, high=num_classes, size=(N,), dtype=torch.long)
# Test forward pass
loss = cross_entropy(scores, labels)
expected_loss = torch.nn.functional.cross_entropy(scores, labels)
test.assertLess(torch.abs(expected_loss-loss).item(), 1e-5)
print('loss=', loss.item())
# Test backward pass
test_block_grad(cross_entropy, scores, y=labels)
loss= 2.728362798690796 Comparing gradients... input diff=0.000
Now that we have some working Layers, we can build an MLP model of arbitrary depth and compute end-to-end gradients.
First, lets copy an idea from PyTorch and implement our own version of the nn.Sequential Module.
This is a Layer which contains other Layers and calls them in sequence. We'll use this to build our MLP model.
TODO: Complete the implementation of the Sequential class in the hw2/layers.py module.
# Test Sequential
# Let's create a long sequence of layers and see
# whether we can compute end-to-end gradients of the whole thing.
seq = layers.Sequential(
layers.Linear(in_features, 100),
layers.Linear(100, 200),
layers.Linear(200, 100),
layers.ReLU(),
layers.Linear(100, 500),
layers.LeakyReLU(alpha=0.01),
layers.Linear(500, 200),
layers.ReLU(),
layers.Linear(200, 500),
layers.LeakyReLU(alpha=0.1),
layers.Linear(500, 1),
layers.Sigmoid(),
)
x_test = torch.randn(N, in_features)
# Test forward pass
z = seq(x_test)
test.assertSequenceEqual(z.shape, [N, 1])
# Test backward pass
test_block_grad(seq, x_test)
Comparing gradients... input diff=0.000 param#01 diff=0.000 param#02 diff=0.000 param#03 diff=0.000 param#04 diff=0.000 param#05 diff=0.000 param#06 diff=0.000 param#07 diff=0.000 param#08 diff=0.000 param#09 diff=0.000 param#10 diff=0.000 param#11 diff=0.000 param#12 diff=0.000 param#13 diff=0.000 param#14 diff=0.000
Now, equipped with a Sequential, all we have to do is create an MLP architecture.
We'll define our MLP with the following hyperparameters:
So the architecture will be:
FC($D$, $h_1$) $\rightarrow$ ReLU $\rightarrow$ FC($h_1$, $h_2$) $\rightarrow$ ReLU $\rightarrow$ $\cdots$ $\rightarrow$ FC($h_{L-1}$, $h_L$) $\rightarrow$ ReLU $\rightarrow$ FC($h_{L}$, $C$)
We'll also create a sequence of the above MLP and a cross-entropy loss, since it's the gradient of the loss that we need in order to train a model.
TODO: Complete the implementation of the MLP class in the hw2/layers.py module. Ignore the dropout parameter for now.
# Create an MLP model
mlp = layers.MLP(in_features, num_classes, hidden_features=[100, 50, 100])
print(mlp)
MLP, Sequential [0] Linear(self.in_features=200, self.out_features=100) [1] ReLU [2] Linear(self.in_features=100, self.out_features=50) [3] ReLU [4] Linear(self.in_features=50, self.out_features=100) [5] ReLU [6] Linear(self.in_features=100, self.out_features=10)
# Test MLP architecture
N = 100
in_features = 10
num_classes = 10
for activation in ('relu', 'sigmoid'):
mlp = layers.MLP(in_features, num_classes, hidden_features=[100, 50, 100], activation=activation)
test.assertEqual(len(mlp.sequence), 7)
num_linear = 0
for b1, b2 in zip(mlp.sequence, mlp.sequence[1:]):
if (str(b2).lower() == activation):
test.assertTrue(str(b1).startswith('Linear'))
num_linear += 1
test.assertTrue(str(mlp.sequence[-1]).startswith('Linear'))
test.assertEqual(num_linear, 3)
# Test MLP gradients
# Test forward pass
x_test = torch.randn(N, in_features)
labels = torch.randint(low=0, high=num_classes, size=(N,), dtype=torch.long)
z = mlp(x_test)
test.assertSequenceEqual(z.shape, [N, num_classes])
# Create a sequence of MLPs and CE loss
seq_mlp = layers.Sequential(mlp, layers.CrossEntropyLoss())
loss = seq_mlp(x_test, y=labels)
test.assertEqual(loss.dim(), 0)
print(f'MLP loss={loss}, activation={activation}')
# Test backward pass
test_block_grad(seq_mlp, x_test, y=labels)
MLP loss=2.309244394302368, activation=relu Comparing gradients... input diff=0.000 param#01 diff=0.000 param#02 diff=0.000 param#03 diff=0.000 param#04 diff=0.000 param#05 diff=0.000 param#06 diff=0.000 param#07 diff=0.000 param#08 diff=0.000 MLP loss=2.3934404850006104, activation=sigmoid Comparing gradients... input diff=0.000 param#01 diff=0.000 param#02 diff=0.000 param#03 diff=0.000 param#04 diff=0.000 param#05 diff=0.000 param#06 diff=0.000 param#07 diff=0.000 param#08 diff=0.000
If the above tests passed then congratulations - you've now implemented an arbitrarily deep model and loss function with end-to-end automatic differentiation!
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw2/answers.py.
from cs3600.answers import display_answer
import hw2.answers
Suppose we have a linear (i.e. fully-connected) layer, defined with in_features=1024 and out_features=2048. We apply this layer to an input tensor $\mat{X}$ containing a batch of N=128 samples.
What would then be the shape of the Jacobian tensor of the output of the layer w.r.t. the input $\mat{X}$?
Assuming we're using single-precision floating point (32 bits) to represent our tensors, How many gigabytes of RAM or GPU memory will be required to store the above Jacobian?
display_answer(hw2.answers.part1_q1)
In this part we will learn how to implement optimization algorithms for deep networks. Additionally, we'll learn how to write training loops and implement a modular model trainer. We'll use our optimizers and training code to test a few configurations for classifying images with an MLP model.
import os
import numpy as np
import matplotlib.pyplot as plt
import unittest
import torch
import torchvision
import torchvision.transforms as tvtf
%matplotlib inline
%load_ext autoreload
%autoreload 2
seed = 42
plt.rcParams.update({'font.size': 12})
test = unittest.TestCase()
In the context of deep learning, an optimization algorithm is some method of iteratively updating model parameters so that the loss converges toward some local minimum (which we hope will be good enough).
Gradient descent-based methods are by far the most popular algorithms for optimization of neural network parameters. However the high-dimensional loss-surfaces we encounter in deep learning applications are highly non-convex. They may be riddled with local minima, saddle points, large plateaus and a host of very challenging "terrain" for gradient-based optimization. This gave rise to many different methods of performing the parameter updates based on the loss gradients, aiming to tackle these optimization challenges.
The most basic gradient-based update rule can be written as,
$$ \vec{\theta} \leftarrow \vec{\theta} - \eta \nabla_{\vec{\theta}} L(\vec{\theta}; \mathcal{D}) $$where $\mathcal{D} = \left\{ (\vec{x}^i, \vec{y}^i) \right\}_{i=1}^{M}$ is our training dataset or part of it. Specifically, if we have in total $N$ training samples, then
The intuition behind gradient descent is simple: since the gradient of a multivariate function points to the direction of steepest ascent ("uphill"), we move in the opposite direction. A small step size $\eta$ known as the learning rate is required since the gradient can only serve as a first-order linear approximation of the function's behaviour at $\vec{\theta}$ (recall e.g. the Taylor expansion). However in truth our loss surface generally has nontrivial curvature caused by a high order nonlinear dependency on $\vec{\theta}$. Thus taking a large step in the direction of the gradient is actually just as likely to increase the function value.

The idea behind the stochastic versions is that by constantly changing the samples we compute the loss with, we get a dynamic error surface, i.e. it's different for each set of training samples. This is thought to generally improve the optimization since it may help the optimizer get out of flat regions or sharp local minima since these features may disappear in the loss surface of subsequent batches. The image below illustrates this. The different lines are different 1-dimensional losses for different training set-samples.

Deep learning frameworks generally provide implementations of various gradient-based optimization algorithms.
Here we'll implement our own optimization module from scratch, this time keeping a similar API to the PyTorch optim package.
We define a base Optimizer class. An optimizer holds a set of parameter tensors (these are the trainable parameters of some model) and maintains internal state. It may be used as follows:
zero_grad() function is invoked to clear the parameter gradients computed by previous iterations.step() function is invoked in order to update the value of each parameter based on it's gradient.The exact method of update is implementation-specific for each optimizer and may depend on its internal state. In addition, adding the regularization penalty to the gradient is handled by the optimizer since it only depends on the parameter values (and not the data).
Here's the API of our Optimizer:
import hw2.optimizers as optimizers
help(optimizers.Optimizer)
Help on class Optimizer in module hw2.optimizers:
class Optimizer(abc.ABC)
| Optimizer(params)
|
| Base class for optimizers.
|
| Method resolution order:
| Optimizer
| abc.ABC
| builtins.object
|
| Methods defined here:
|
| __init__(self, params)
| :param params: A sequence of model parameters to optimize. Can be a
| list of (param,grad) tuples as returned by the Layers, or a list of
| pytorch tensors in which case the grad will be taken from them.
|
| step(self)
| Updates all the registered parameter values based on their gradients.
|
| zero_grad(self)
| Sets the gradient of the optimized parameters to zero (in place).
|
| ----------------------------------------------------------------------
| Readonly properties defined here:
|
| params
| :return: A sequence of parameter tuples, each tuple containing
| (param_data, param_grad). The data should be updated in-place
| according to the grad.
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| __abstractmethods__ = frozenset({'step'})
Let's start by implementing the simplest gradient based optimizer. The update rule will be exacly as stated above, but we'll also add a L2-regularization term to the gradient. Remember that in the loss function, the L2 regularization term is expressed by
$$R(\vec{\theta}) = \frac{1}{2}\lambda||\vec{\theta}||^2_2.$$TODO: Complete the implementation of the VanillaSGD class in the hw2/optimizers.py module.
# Test VanillaSGD
torch.manual_seed(42)
p = torch.randn(500, 10)
dp = torch.randn(*p.shape)*2
params = [(p, dp)]
vsgd = optimizers.VanillaSGD(params, learn_rate=0.5, reg=0.1)
vsgd.step()
expected_p = torch.load('tests/assets/expected_vsgd.pt')
diff = torch.norm(p-expected_p).item()
print(f'diff={diff}')
test.assertLess(diff, 1e-3)
diff=0.0
Now that we can build a model and loss function, compute their gradients and we have an optimizer, we can finally do some training!
In the spirit of more modular software design, we'll implement a class that will aid us in automating the repetitive training loop code that we usually write over and over again. This will be useful for both training our Layer-based models and also later for training PyTorch nn.Modules.
Here's our Trainer API:
import hw2.training as training
help(training.Trainer)
Help on class Trainer in module hw2.training:
class Trainer(abc.ABC)
| Trainer(model, loss_fn, optimizer, device=None)
|
| A class abstracting the various tasks of training models.
|
| Provides methods at multiple levels of granularity:
| - Multiple epochs (fit)
| - Single epoch (train_epoch/test_epoch)
| - Single batch (train_batch/test_batch)
|
| Method resolution order:
| Trainer
| abc.ABC
| builtins.object
|
| Methods defined here:
|
| __init__(self, model, loss_fn, optimizer, device=None)
| Initialize the trainer.
| :param model: Instance of the model to train.
| :param loss_fn: The loss function to evaluate with.
| :param optimizer: The optimizer to train with.
| :param device: torch.device to run training on (CPU or GPU).
|
| fit(self, dl_train: torch.utils.data.dataloader.DataLoader, dl_test: torch.utils.data.dataloader.DataLoader, num_epochs, checkpoints: str = None, early_stopping: int = None, print_every=1, **kw) -> cs3600.train_results.FitResult
| Trains the model for multiple epochs with a given training set,
| and calculates validation loss over a given validation set.
| :param dl_train: Dataloader for the training set.
| :param dl_test: Dataloader for the test set.
| :param num_epochs: Number of epochs to train for.
| :param checkpoints: Whether to save model to file every time the
| test set accuracy improves. Should be a string containing a
| filename without extension.
| :param early_stopping: Whether to stop training early if there is no
| test loss improvement for this number of epochs.
| :param print_every: Print progress every this number of epochs.
| :return: A FitResult object containing train and test losses per epoch.
|
| test_batch(self, batch) -> cs3600.train_results.BatchResult
| Runs a single batch forward through the model and calculates loss.
| :param batch: A single batch of data from a data loader (might
| be a tuple of data and labels or anything else depending on
| the underlying dataset.
| :return: A BatchResult containing the value of the loss function and
| the number of correctly classified samples in the batch.
|
| test_epoch(self, dl_test: torch.utils.data.dataloader.DataLoader, **kw) -> cs3600.train_results.EpochResult
| Evaluate model once over a test set (single epoch).
| :param dl_test: DataLoader for the test set.
| :param kw: Keyword args supported by _foreach_batch.
| :return: An EpochResult for the epoch.
|
| train_batch(self, batch) -> cs3600.train_results.BatchResult
| Runs a single batch forward through the model, calculates loss,
| preforms back-propagation and uses the optimizer to update weights.
| :param batch: A single batch of data from a data loader (might
| be a tuple of data and labels or anything else depending on
| the underlying dataset.
| :return: A BatchResult containing the value of the loss function and
| the number of correctly classified samples in the batch.
|
| train_epoch(self, dl_train: torch.utils.data.dataloader.DataLoader, **kw) -> cs3600.train_results.EpochResult
| Train once over a training set (single epoch).
| :param dl_train: DataLoader for the training set.
| :param kw: Keyword args supported by _foreach_batch.
| :return: An EpochResult for the epoch.
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| __abstractmethods__ = frozenset({'test_batch', 'train_batch'})
The Trainer class splits the task of training (and evaluating) models into three conceptual levels,
fit method, which returns a FitResult containing losses and accuracies for all epochs.train_epoch and test_epoch methods, which return an EpochResult containing losses per batch and the single accuracy result of the epoch.train_batch and test_batch methods, which return a BatchResult containing a single loss and the number of correctly classified samples in the batch.It implements the first two levels. Inheriting classes are expected to implement the single-batch level methods since these are model and/or task specific.
The first thing we should do in order to verify our model, gradient calculations and optimizer implementation is to try to overfit a large model (many parameters) to a small dataset (few images). This will show us that things are working properly.
Let's begin by loading the CIFAR-10 dataset.
data_dir = os.path.expanduser('~/.pytorch-datasets')
ds_train = torchvision.datasets.CIFAR10(root=data_dir, download=True, train=True, transform=tvtf.ToTensor())
ds_test = torchvision.datasets.CIFAR10(root=data_dir, download=True, train=False, transform=tvtf.ToTensor())
print(f'Train: {len(ds_train)} samples')
print(f'Test: {len(ds_test)} samples')
Files already downloaded and verified Files already downloaded and verified Train: 50000 samples Test: 10000 samples
Now, let's implement just a small part of our training logic since that's what we need right now.
TODO:
train_batch() method in the LayerTrainer class within the hw2/training.py module.part2_overfit_hp() function in the hw2/answers.py module. Tweak the hyperparameter values until your model overfits a small number of samples in the code block below. You should get 100% accuracy within a few epochs.The following code block will use your custom Layer-based MLP implentation, custom Vanilla SGD and custom trainer to overfit the data. The classification accuracy should be 100% within a few epochs.
import hw2.layers as layers
import hw2.answers as answers
from torch.utils.data import DataLoader
# Overfit to a very small dataset of 20 samples
batch_size = 10
max_batches = 2
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=False)
# Get hyperparameters
hp = answers.part2_overfit_hp()
torch.manual_seed(seed)
# Build a model and loss using our custom MLP and CE implementations
model = layers.MLP(3*32*32, num_classes=10, hidden_features=[128]*3, wstd=hp['wstd'])
loss_fn = layers.CrossEntropyLoss()
# Use our custom optimizer
optimizer = optimizers.VanillaSGD(model.params(), learn_rate=hp['lr'], reg=hp['reg'])
# Run training over small dataset multiple times
trainer = training.LayerTrainer(model, loss_fn, optimizer)
best_acc = 0
for i in range(20):
res = trainer.train_epoch(dl_train, max_batches=max_batches)
best_acc = res.accuracy if res.accuracy > best_acc else best_acc
test.assertGreaterEqual(best_acc, 98)
train_batch (Avg. Loss 3.221, Accuracy 0.0): 100%|██████████| 2/2 [00:00<00:00, 118.06it/s]
/Users/guyattia/PycharmProjects/MSC-DL-Course/hw2/hw2/layers.py:104: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). x_positive = torch.tensor(dout * positive_grad) /Users/guyattia/PycharmProjects/MSC-DL-Course/hw2/hw2/layers.py:105: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). x_negative = torch.tensor(dout * negative_grad)
train_batch (Avg. Loss 2.530, Accuracy 25.0): 100%|██████████| 2/2 [00:00<00:00, 153.92it/s] train_batch (Avg. Loss 2.124, Accuracy 40.0): 100%|██████████| 2/2 [00:00<00:00, 165.05it/s] train_batch (Avg. Loss 1.658, Accuracy 55.0): 100%|██████████| 2/2 [00:00<00:00, 175.53it/s] train_batch (Avg. Loss 1.242, Accuracy 60.0): 100%|██████████| 2/2 [00:00<00:00, 199.46it/s] train_batch (Avg. Loss 1.892, Accuracy 50.0): 100%|██████████| 2/2 [00:00<00:00, 195.27it/s] train_batch (Avg. Loss 2.125, Accuracy 30.0): 100%|██████████| 2/2 [00:00<00:00, 187.26it/s] train_batch (Avg. Loss 2.022, Accuracy 40.0): 100%|██████████| 2/2 [00:00<00:00, 197.61it/s] train_batch (Avg. Loss 1.404, Accuracy 60.0): 100%|██████████| 2/2 [00:00<00:00, 188.50it/s] train_batch (Avg. Loss 1.191, Accuracy 65.0): 100%|██████████| 2/2 [00:00<00:00, 187.97it/s] train_batch (Avg. Loss 1.372, Accuracy 55.0): 100%|██████████| 2/2 [00:00<00:00, 187.04it/s] train_batch (Avg. Loss 1.396, Accuracy 55.0): 100%|██████████| 2/2 [00:00<00:00, 198.06it/s] train_batch (Avg. Loss 0.724, Accuracy 90.0): 100%|██████████| 2/2 [00:00<00:00, 192.74it/s] train_batch (Avg. Loss 0.708, Accuracy 85.0): 100%|██████████| 2/2 [00:00<00:00, 191.92it/s] train_batch (Avg. Loss 1.766, Accuracy 50.0): 100%|██████████| 2/2 [00:00<00:00, 195.62it/s] train_batch (Avg. Loss 0.628, Accuracy 90.0): 100%|██████████| 2/2 [00:00<00:00, 193.17it/s] train_batch (Avg. Loss 0.479, Accuracy 85.0): 100%|██████████| 2/2 [00:00<00:00, 193.82it/s] train_batch (Avg. Loss 0.228, Accuracy 95.0): 100%|██████████| 2/2 [00:00<00:00, 189.50it/s] train_batch (Avg. Loss 0.124, Accuracy 100.0): 100%|██████████| 2/2 [00:00<00:00, 191.33it/s] train_batch (Avg. Loss 0.076, Accuracy 100.0): 100%|██████████| 2/2 [00:00<00:00, 193.83it/s]
Now that we know training works, let's try to fit a model to a bit more data for a few epochs, to see how well we're doing. First, we need a function to plot the FitResults object.
from cs3600.plot import plot_fit
plot_fit?
TODO:
test_batch() method in the LayerTrainer class within the hw2/training.py module.fit() method of the Trainer class within the hw2/training.py module.part2_optim_hp() function in the hw2/answers.py module.# Define a larger part of the CIFAR-10 dataset (still not the whole thing)
batch_size = 50
max_batches = 100
in_features = 3*32*32
num_classes = 10
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=False)
dl_test = torch.utils.data.DataLoader(ds_test, batch_size//2, shuffle=False)
# Define a function to train a model with our Trainer and various optimizers
def train_with_optimizer(opt_name, opt_class, fig):
torch.manual_seed(seed)
# Get hyperparameters
hp = answers.part2_optim_hp()
hidden_features = [128] * 5
num_epochs = 10
# Create model, loss and optimizer instances
model = layers.MLP(in_features, num_classes, hidden_features, wstd=hp['wstd'])
loss_fn = layers.CrossEntropyLoss()
optimizer = opt_class(model.params(), learn_rate=hp[f'lr_{opt_name}'], reg=hp['reg'])
# Train with the Trainer
trainer = training.LayerTrainer(model, loss_fn, optimizer)
fit_res = trainer.fit(dl_train, dl_test, num_epochs, max_batches=max_batches)
fig, axes = plot_fit(fit_res, fig=fig, legend=opt_name)
return fig
fig_optim = None
fig_optim = train_with_optimizer('vanilla', optimizers.VanillaSGD, fig_optim)
--- EPOCH 1/10 --- train_batch (Avg. Loss 2.161, Accuracy 19.8): 100%|██████████| 100/100 [00:00<00:00, 117.58it/s] test_batch (Avg. Loss 2.044, Accuracy 24.3): 100%|██████████| 100/100 [00:00<00:00, 232.85it/s] --- EPOCH 2/10 --- train_batch (Avg. Loss 1.989, Accuracy 27.1): 100%|██████████| 100/100 [00:00<00:00, 115.12it/s] test_batch (Avg. Loss 1.955, Accuracy 29.7): 100%|██████████| 100/100 [00:00<00:00, 232.45it/s] --- EPOCH 3/10 --- train_batch (Avg. Loss 1.906, Accuracy 30.1): 100%|██████████| 100/100 [00:00<00:00, 117.83it/s] test_batch (Avg. Loss 1.891, Accuracy 32.2): 100%|██████████| 100/100 [00:00<00:00, 233.06it/s] --- EPOCH 4/10 --- train_batch (Avg. Loss 1.850, Accuracy 32.9): 100%|██████████| 100/100 [00:00<00:00, 116.00it/s] test_batch (Avg. Loss 1.857, Accuracy 34.5): 100%|██████████| 100/100 [00:00<00:00, 233.15it/s] --- EPOCH 5/10 --- train_batch (Avg. Loss 1.810, Accuracy 34.4): 100%|██████████| 100/100 [00:00<00:00, 116.47it/s] test_batch (Avg. Loss 1.826, Accuracy 35.1): 100%|██████████| 100/100 [00:00<00:00, 238.43it/s] --- EPOCH 6/10 --- train_batch (Avg. Loss 1.778, Accuracy 35.6): 100%|██████████| 100/100 [00:00<00:00, 115.35it/s] test_batch (Avg. Loss 1.809, Accuracy 35.9): 100%|██████████| 100/100 [00:00<00:00, 234.09it/s] --- EPOCH 7/10 --- train_batch (Avg. Loss 1.750, Accuracy 36.5): 100%|██████████| 100/100 [00:00<00:00, 112.51it/s] test_batch (Avg. Loss 1.802, Accuracy 36.2): 100%|██████████| 100/100 [00:00<00:00, 236.33it/s] --- EPOCH 8/10 --- train_batch (Avg. Loss 1.730, Accuracy 37.4): 100%|██████████| 100/100 [00:00<00:00, 116.63it/s] test_batch (Avg. Loss 1.787, Accuracy 36.6): 100%|██████████| 100/100 [00:00<00:00, 203.35it/s] --- EPOCH 9/10 --- train_batch (Avg. Loss 1.710, Accuracy 38.5): 100%|██████████| 100/100 [00:00<00:00, 115.55it/s] test_batch (Avg. Loss 1.783, Accuracy 36.2): 100%|██████████| 100/100 [00:00<00:00, 235.28it/s] --- EPOCH 10/10 --- train_batch (Avg. Loss 1.694, Accuracy 39.0): 100%|██████████| 100/100 [00:00<00:00, 117.56it/s] test_batch (Avg. Loss 1.774, Accuracy 36.8): 100%|██████████| 100/100 [00:00<00:00, 235.32it/s]
The simple vanilla SGD update is rarely used in practice since it's very slow to converge relative to other optimization algorithms.
One reason is that naïvely updating in the direction of the current gradient causes it to fluctuate wildly in areas where the loss surface in some dimensions is much steeper than in others. Another reason is that using the same learning rate for all parameters is not a great idea since not all parameters are created equal. For example, parameters associated with rare features should be updated with a larger step than ones associated with commonly-occurring features because they'll get less updates through the gradients.
Therefore more advanced optimizers take into account the previous gradients of a parameter and/or try to use a per-parameter specific learning rate instead of a common one.
Let's now implement a simple and common optimizer: SGD with Momentum. This optimizer takes previous gradients of a parameter into account when updating it's value instead of just the current one. In practice it usually provides faster convergence than the vanilla SGD.
The SGD with Momentum update rule can be stated as follows: $$\begin{align} \vec{v}_{t+1} &= \mu \vec{v}_t - \eta \delta \vec{\theta}_t \\ \vec{\theta}_{t+1} &= \vec{\theta}_t + \vec{v}_{t+1} \end{align}$$
Where $\eta$ is the learning rate, $\vec{\theta}$ is a model parameter, $\delta \vec{\theta}_t=\pderiv{L}{\vec{\theta}}(\vec{\theta}_t)$ is the gradient of the loss w.r.t. to the parameter and $0\leq\mu<1$ is a hyperparameter known as momentum.
Expanding the update rule recursively shows us now the parameter update infact depends on all previous gradient values for that parameter, where the old gradients are exponentially decayed by a factor of $\mu$ at each timestep.
Since we're incorporating previous gradient (update directions), a noisy value of the current gradient will have less effect so that the general direction of previous updates is maintained somewhat. The following figure illustrates this.

TODO:
MomentumSGD class in the hw2/optimizers.py module.part2_optim_hp() the function in the hw2/answers.py module.fig_optim = train_with_optimizer('momentum', optimizers.MomentumSGD, fig_optim)
fig_optim
--- EPOCH 1/10 --- train_batch (2.153): 21%|██ | 21/100 [00:00<00:00, 102.23it/s]
/Users/guyattia/PycharmProjects/MSC-DL-Course/hw2/hw2/layers.py:104: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). x_positive = torch.tensor(dout * positive_grad) /Users/guyattia/PycharmProjects/MSC-DL-Course/hw2/hw2/layers.py:105: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). x_negative = torch.tensor(dout * negative_grad)
train_batch (Avg. Loss 2.152, Accuracy 19.4): 100%|██████████| 100/100 [00:01<00:00, 93.87it/s] test_batch (Avg. Loss 2.008, Accuracy 25.4): 100%|██████████| 100/100 [00:00<00:00, 178.95it/s] --- EPOCH 2/10 --- train_batch (Avg. Loss 1.943, Accuracy 28.2): 100%|██████████| 100/100 [00:00<00:00, 103.23it/s] test_batch (Avg. Loss 1.966, Accuracy 28.8): 100%|██████████| 100/100 [00:00<00:00, 234.98it/s] --- EPOCH 3/10 --- train_batch (Avg. Loss 1.858, Accuracy 33.2): 100%|██████████| 100/100 [00:00<00:00, 109.93it/s] test_batch (Avg. Loss 1.879, Accuracy 33.0): 100%|██████████| 100/100 [00:00<00:00, 233.64it/s] --- EPOCH 4/10 --- train_batch (Avg. Loss 1.797, Accuracy 35.1): 100%|██████████| 100/100 [00:00<00:00, 109.46it/s] test_batch (Avg. Loss 1.877, Accuracy 32.7): 100%|██████████| 100/100 [00:00<00:00, 231.50it/s] --- EPOCH 5/10 --- train_batch (Avg. Loss 1.761, Accuracy 36.5): 100%|██████████| 100/100 [00:00<00:00, 110.62it/s] test_batch (Avg. Loss 1.864, Accuracy 33.4): 100%|██████████| 100/100 [00:00<00:00, 232.78it/s] --- EPOCH 6/10 --- train_batch (Avg. Loss 1.736, Accuracy 37.2): 100%|██████████| 100/100 [00:00<00:00, 109.36it/s] test_batch (Avg. Loss 1.818, Accuracy 35.3): 100%|██████████| 100/100 [00:00<00:00, 231.99it/s] --- EPOCH 7/10 --- train_batch (Avg. Loss 1.705, Accuracy 38.6): 100%|██████████| 100/100 [00:00<00:00, 108.60it/s] test_batch (Avg. Loss 1.805, Accuracy 35.7): 100%|██████████| 100/100 [00:00<00:00, 235.76it/s] --- EPOCH 8/10 --- train_batch (Avg. Loss 1.680, Accuracy 39.6): 100%|██████████| 100/100 [00:00<00:00, 110.48it/s] test_batch (Avg. Loss 1.805, Accuracy 35.4): 100%|██████████| 100/100 [00:00<00:00, 231.29it/s] --- EPOCH 9/10 --- train_batch (Avg. Loss 1.662, Accuracy 40.7): 100%|██████████| 100/100 [00:00<00:00, 109.76it/s] test_batch (Avg. Loss 1.798, Accuracy 35.7): 100%|██████████| 100/100 [00:00<00:00, 234.24it/s] --- EPOCH 10/10 --- train_batch (Avg. Loss 1.649, Accuracy 41.1): 100%|██████████| 100/100 [00:00<00:00, 110.27it/s] test_batch (Avg. Loss 1.791, Accuracy 36.2): 100%|██████████| 100/100 [00:00<00:00, 230.93it/s]
This is another optmizer that accounts for previous gradients, but this time it uses them to adapt the learning rate per parameter.
RMSProp maintains a decaying moving average of previous squared gradients, $$ \vec{r}_{t+1} = \gamma\vec{r}_{t} + (1-\gamma)\delta\vec{\theta}_t^2 $$ where $0<\gamma<1$ is a decay constant usually set close to $1$, and $\delta\vec{\theta}_t^2$ denotes element-wise squaring.
The update rule for each parameter is then, $$ \vec{\theta}_{t+1} = \vec{\theta}_t - \left( \frac{\eta}{\sqrt{r_{t+1}+\varepsilon}} \right) \delta\vec{\theta}_t $$
where $\varepsilon$ is a small constant to prevent numerical instability. The idea here is to decrease the learning rate for parameters with high gradient values and vice-versa. The decaying moving average prevents accumulating all the past gradients which would cause the effective learning rate to become zero.
TODO:
RMSProp class in the hw2/optimizers.py module.part2_optim_hp() the function in the hw2/answers.py module.fig_optim = train_with_optimizer('rmsprop', optimizers.RMSProp, fig_optim)
fig_optim
--- EPOCH 1/10 --- train_batch (2.632): 18%|█▊ | 18/100 [00:00<00:00, 91.73it/s]
/Users/guyattia/PycharmProjects/MSC-DL-Course/hw2/hw2/layers.py:104: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). x_positive = torch.tensor(dout * positive_grad) /Users/guyattia/PycharmProjects/MSC-DL-Course/hw2/hw2/layers.py:105: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). x_negative = torch.tensor(dout * negative_grad)
train_batch (Avg. Loss 2.664, Accuracy 9.0): 100%|██████████| 100/100 [00:01<00:00, 98.96it/s] test_batch (Avg. Loss 2.673, Accuracy 9.3): 100%|██████████| 100/100 [00:00<00:00, 230.99it/s] --- EPOCH 2/10 --- train_batch (Avg. Loss 2.664, Accuracy 9.0): 100%|██████████| 100/100 [00:00<00:00, 100.09it/s] test_batch (Avg. Loss 2.673, Accuracy 9.3): 100%|██████████| 100/100 [00:00<00:00, 232.91it/s] --- EPOCH 3/10 --- train_batch (Avg. Loss 2.664, Accuracy 9.0): 100%|██████████| 100/100 [00:01<00:00, 99.26it/s] test_batch (Avg. Loss 2.673, Accuracy 9.3): 100%|██████████| 100/100 [00:00<00:00, 233.00it/s] --- EPOCH 4/10 --- train_batch (Avg. Loss 2.664, Accuracy 9.0): 100%|██████████| 100/100 [00:00<00:00, 100.29it/s] test_batch (Avg. Loss 2.673, Accuracy 9.3): 100%|██████████| 100/100 [00:00<00:00, 234.06it/s] --- EPOCH 5/10 --- train_batch (Avg. Loss 2.664, Accuracy 9.0): 100%|██████████| 100/100 [00:01<00:00, 98.95it/s] test_batch (Avg. Loss 2.673, Accuracy 9.3): 100%|██████████| 100/100 [00:00<00:00, 232.49it/s] --- EPOCH 6/10 --- train_batch (Avg. Loss 2.664, Accuracy 9.0): 100%|██████████| 100/100 [00:01<00:00, 96.04it/s] test_batch (Avg. Loss 2.673, Accuracy 9.3): 100%|██████████| 100/100 [00:00<00:00, 230.46it/s] --- EPOCH 7/10 --- train_batch (Avg. Loss 2.664, Accuracy 9.0): 100%|██████████| 100/100 [00:01<00:00, 97.15it/s] test_batch (Avg. Loss 2.673, Accuracy 9.3): 100%|██████████| 100/100 [00:00<00:00, 233.10it/s] --- EPOCH 8/10 --- train_batch (Avg. Loss 2.664, Accuracy 9.0): 100%|██████████| 100/100 [00:01<00:00, 99.27it/s] test_batch (Avg. Loss 2.673, Accuracy 9.3): 100%|██████████| 100/100 [00:00<00:00, 218.57it/s] --- EPOCH 9/10 --- train_batch (Avg. Loss 2.664, Accuracy 9.0): 100%|██████████| 100/100 [00:01<00:00, 99.72it/s] test_batch (Avg. Loss 2.673, Accuracy 9.3): 100%|██████████| 100/100 [00:00<00:00, 231.04it/s] --- EPOCH 10/10 --- train_batch (Avg. Loss 2.664, Accuracy 9.0): 100%|██████████| 100/100 [00:01<00:00, 99.26it/s] test_batch (Avg. Loss 2.673, Accuracy 9.3): 100%|██████████| 100/100 [00:00<00:00, 239.20it/s]
Note that you should get better train/test accuracy with Momentum and RMSProp than Vanilla.
Dropout is a useful technique to improve generalization of deep models.
The idea is simple: during the forward pass drop, i.e. set to to zero, the activation of each neuron, with a probability of $p$. For example, if $p=0.4$ this means we drop the activations of 40% of the neurons (on average).
There are a few important things to note about dropout:
TODO:
Dropout class in the hw2/layers.py module.MLP's __init__() method in the hw2/layers.py module.
If dropout>0 you should add a Dropout layer after each ReLU.from hw2.grad_compare import compare_layer_to_torch
# Check architecture of MLP with dropout layers
mlp_dropout = layers.MLP(in_features, num_classes, [50]*3, dropout=0.6)
print(mlp_dropout)
test.assertEqual(len(mlp_dropout.sequence), 10)
for b1, b2 in zip(mlp_dropout.sequence, mlp_dropout.sequence[1:]):
if str(b1).lower() == 'relu':
test.assertTrue(str(b2).startswith('Dropout'))
test.assertTrue(str(mlp_dropout.sequence[-1]).startswith('Linear'))
MLP, Sequential [0] Linear(self.in_features=3072, self.out_features=50) [1] ReLU [2] Dropout(p=0.6) [3] Linear(self.in_features=50, self.out_features=50) [4] ReLU [5] Dropout(p=0.6) [6] Linear(self.in_features=50, self.out_features=50) [7] ReLU [8] Dropout(p=0.6) [9] Linear(self.in_features=50, self.out_features=10)
# Test end-to-end gradient in train and test modes.
print('Dropout, train mode')
mlp_dropout.train(True)
for diff in compare_layer_to_torch(mlp_dropout, torch.randn(500, in_features)):
test.assertLess(diff, 1e-3)
print('Dropout, test mode')
mlp_dropout.train(False)
for diff in compare_layer_to_torch(mlp_dropout, torch.randn(500, in_features)):
test.assertLess(diff, 1e-3)
Dropout, train mode Comparing gradients... input diff=0.000 param#01 diff=0.000 param#02 diff=0.000 param#03 diff=0.000 param#04 diff=0.000 param#05 diff=0.000 param#06 diff=0.000 param#07 diff=0.000 param#08 diff=0.000 Dropout, test mode Comparing gradients... input diff=0.000 param#01 diff=0.000 param#02 diff=0.000 param#03 diff=0.000 param#04 diff=0.000 param#05 diff=0.000 param#06 diff=0.000 param#07 diff=0.000 param#08 diff=0.000
/Users/guyattia/PycharmProjects/MSC-DL-Course/hw2/hw2/layers.py:104: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). x_positive = torch.tensor(dout * positive_grad) /Users/guyattia/PycharmProjects/MSC-DL-Course/hw2/hw2/layers.py:105: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). x_negative = torch.tensor(dout * negative_grad)
To see whether dropout really improves generalization, let's take a small training set (small enough to overfit) and a large test set and check whether we get less overfitting and perhaps improved test-set accuracy when using dropout.
# Define a small set from CIFAR-10, but take a larger test set since we want to test generalization
batch_size = 10
max_batches = 40
in_features = 3*32*32
num_classes = 10
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=False)
dl_test = torch.utils.data.DataLoader(ds_test, batch_size*2, shuffle=False)
Provided:
Tweak the hyperparameters for this section in the part2_dropout_hp() function in the hw2/answers.py module. Try to set them so that the first model (with dropout=0) overfits. You can disable the other dropout options until you tune the hyperparameters. We can then see the effect of dropout for generalization.
# Get hyperparameters
hp = answers.part2_dropout_hp()
hidden_features = [400] * 1
num_epochs = 30
torch.manual_seed(seed)
fig=None
#for dropout in [0]: # Use this for tuning the hyperparms until you overfit
for dropout in [0, 0.4, 0.8]:
model = layers.MLP(in_features, num_classes, hidden_features, wstd=hp['wstd'], dropout=dropout)
loss_fn = layers.CrossEntropyLoss()
optimizer = optimizers.MomentumSGD(model.params(), learn_rate=hp['lr'], reg=0)
print('*** Training with dropout=', dropout)
trainer = training.LayerTrainer(model, loss_fn, optimizer)
fit_res_dropout = trainer.fit(dl_train, dl_test, num_epochs, max_batches=max_batches, print_every=6)
fig, axes = plot_fit(fit_res_dropout, fig=fig, legend=f'dropout={dropout}', log_loss=True)
*** Training with dropout= 0 --- EPOCH 1/30 --- train_batch (Avg. Loss 3.726, Accuracy 13.0): 100%|██████████| 40/40 [00:00<00:00, 120.29it/s] test_batch (Avg. Loss 2.717, Accuracy 14.4): 100%|██████████| 40/40 [00:00<00:00, 246.06it/s] train_batch (Avg. Loss 2.323, Accuracy 26.2): 100%|██████████| 40/40 [00:00<00:00, 124.43it/s] test_batch (Avg. Loss 2.527, Accuracy 21.0): 100%|██████████| 40/40 [00:00<00:00, 250.23it/s] train_batch (Avg. Loss 1.937, Accuracy 34.2): 100%|██████████| 40/40 [00:00<00:00, 125.88it/s] test_batch (Avg. Loss 2.523, Accuracy 22.8): 100%|██████████| 40/40 [00:00<00:00, 243.39it/s] train_batch (Avg. Loss 1.723, Accuracy 42.5): 100%|██████████| 40/40 [00:00<00:00, 124.52it/s] test_batch (Avg. Loss 2.556, Accuracy 22.5): 100%|██████████| 40/40 [00:00<00:00, 244.35it/s] train_batch (Avg. Loss 1.605, Accuracy 44.2): 100%|██████████| 40/40 [00:00<00:00, 124.26it/s] test_batch (Avg. Loss 2.541, Accuracy 22.4): 100%|██████████| 40/40 [00:00<00:00, 242.90it/s] train_batch (Avg. Loss 1.507, Accuracy 46.8): 100%|██████████| 40/40 [00:00<00:00, 124.31it/s] test_batch (Avg. Loss 2.508, Accuracy 23.5): 100%|██████████| 40/40 [00:00<00:00, 241.55it/s] --- EPOCH 7/30 --- train_batch (Avg. Loss 1.425, Accuracy 49.0): 100%|██████████| 40/40 [00:00<00:00, 125.62it/s] test_batch (Avg. Loss 2.543, Accuracy 22.5): 100%|██████████| 40/40 [00:00<00:00, 240.54it/s] train_batch (Avg. Loss 1.313, Accuracy 55.8): 100%|██████████| 40/40 [00:00<00:00, 125.90it/s] test_batch (Avg. Loss 2.556, Accuracy 21.4): 100%|██████████| 40/40 [00:00<00:00, 244.23it/s] train_batch (Avg. Loss 1.200, Accuracy 61.0): 100%|██████████| 40/40 [00:00<00:00, 125.22it/s] test_batch (Avg. Loss 2.544, Accuracy 20.8): 100%|██████████| 40/40 [00:00<00:00, 243.02it/s] train_batch (Avg. Loss 1.097, Accuracy 64.5): 100%|██████████| 40/40 [00:00<00:00, 125.32it/s] test_batch (Avg. Loss 2.540, Accuracy 21.9): 100%|██████████| 40/40 [00:00<00:00, 245.07it/s] train_batch (Avg. Loss 1.024, Accuracy 67.2): 100%|██████████| 40/40 [00:00<00:00, 125.84it/s] test_batch (Avg. Loss 2.567, Accuracy 21.6): 100%|██████████| 40/40 [00:00<00:00, 254.21it/s] train_batch (Avg. Loss 0.964, Accuracy 71.0): 100%|██████████| 40/40 [00:00<00:00, 122.01it/s] test_batch (Avg. Loss 2.560, Accuracy 21.4): 100%|██████████| 40/40 [00:00<00:00, 254.25it/s] --- EPOCH 13/30 --- train_batch (Avg. Loss 0.906, Accuracy 73.2): 100%|██████████| 40/40 [00:00<00:00, 122.48it/s] test_batch (Avg. Loss 2.535, Accuracy 22.0): 100%|██████████| 40/40 [00:00<00:00, 254.38it/s] train_batch (Avg. Loss 0.845, Accuracy 75.2): 100%|██████████| 40/40 [00:00<00:00, 122.51it/s] test_batch (Avg. Loss 2.516, Accuracy 22.8): 100%|██████████| 40/40 [00:00<00:00, 251.35it/s] train_batch (Avg. Loss 0.784, Accuracy 78.0): 100%|██████████| 40/40 [00:00<00:00, 122.36it/s] test_batch (Avg. Loss 2.514, Accuracy 22.9): 100%|██████████| 40/40 [00:00<00:00, 250.23it/s] train_batch (Avg. Loss 0.734, Accuracy 80.2): 100%|██████████| 40/40 [00:00<00:00, 121.42it/s] test_batch (Avg. Loss 2.531, Accuracy 23.6): 100%|██████████| 40/40 [00:00<00:00, 251.02it/s] train_batch (Avg. Loss 0.690, Accuracy 82.8): 100%|██████████| 40/40 [00:00<00:00, 122.44it/s] test_batch (Avg. Loss 2.550, Accuracy 23.6): 100%|██████████| 40/40 [00:00<00:00, 253.03it/s] train_batch (Avg. Loss 0.650, Accuracy 83.8): 100%|██████████| 40/40 [00:00<00:00, 122.35it/s] test_batch (Avg. Loss 2.570, Accuracy 23.8): 100%|██████████| 40/40 [00:00<00:00, 253.90it/s] --- EPOCH 19/30 --- train_batch (Avg. Loss 0.615, Accuracy 84.8): 100%|██████████| 40/40 [00:00<00:00, 121.27it/s] test_batch (Avg. Loss 2.579, Accuracy 23.2): 100%|██████████| 40/40 [00:00<00:00, 253.90it/s] train_batch (Avg. Loss 0.584, Accuracy 86.8): 100%|██████████| 40/40 [00:00<00:00, 121.00it/s] test_batch (Avg. Loss 2.593, Accuracy 22.6): 100%|██████████| 40/40 [00:00<00:00, 254.58it/s] train_batch (Avg. Loss 0.556, Accuracy 87.0): 100%|██████████| 40/40 [00:00<00:00, 121.97it/s] test_batch (Avg. Loss 2.606, Accuracy 23.0): 100%|██████████| 40/40 [00:00<00:00, 254.21it/s] train_batch (Avg. Loss 0.531, Accuracy 87.5): 100%|██████████| 40/40 [00:00<00:00, 120.40it/s] test_batch (Avg. Loss 2.625, Accuracy 23.0): 100%|██████████| 40/40 [00:00<00:00, 255.84it/s] train_batch (Avg. Loss 0.509, Accuracy 89.5): 100%|██████████| 40/40 [00:00<00:00, 121.68it/s] test_batch (Avg. Loss 2.649, Accuracy 22.6): 100%|██████████| 40/40 [00:00<00:00, 253.86it/s] train_batch (Avg. Loss 0.488, Accuracy 91.2): 100%|██████████| 40/40 [00:00<00:00, 121.37it/s] test_batch (Avg. Loss 2.672, Accuracy 22.5): 100%|██████████| 40/40 [00:00<00:00, 246.77it/s] --- EPOCH 25/30 --- train_batch (Avg. Loss 0.468, Accuracy 91.2): 100%|██████████| 40/40 [00:00<00:00, 118.66it/s] test_batch (Avg. Loss 2.707, Accuracy 23.1): 100%|██████████| 40/40 [00:00<00:00, 248.85it/s] train_batch (Avg. Loss 0.453, Accuracy 93.0): 100%|██████████| 40/40 [00:00<00:00, 118.98it/s] test_batch (Avg. Loss 2.742, Accuracy 22.8): 100%|██████████| 40/40 [00:00<00:00, 249.16it/s] train_batch (Avg. Loss 0.437, Accuracy 93.2): 100%|██████████| 40/40 [00:00<00:00, 118.09it/s] test_batch (Avg. Loss 2.785, Accuracy 22.1): 100%|██████████| 40/40 [00:00<00:00, 248.56it/s] train_batch (Avg. Loss 0.424, Accuracy 93.0): 100%|██████████| 40/40 [00:00<00:00, 119.56it/s] test_batch (Avg. Loss 2.823, Accuracy 22.1): 100%|██████████| 40/40 [00:00<00:00, 248.42it/s] train_batch (Avg. Loss 0.410, Accuracy 93.5): 100%|██████████| 40/40 [00:00<00:00, 120.22it/s] test_batch (Avg. Loss 2.866, Accuracy 22.8): 100%|██████████| 40/40 [00:00<00:00, 251.80it/s] --- EPOCH 30/30 --- train_batch (Avg. Loss 0.399, Accuracy 94.0): 100%|██████████| 40/40 [00:00<00:00, 120.68it/s] test_batch (Avg. Loss 2.901, Accuracy 22.5): 100%|██████████| 40/40 [00:00<00:00, 239.78it/s] *** Training with dropout= 0.4 --- EPOCH 1/30 --- train_batch (4.390): 62%|██████▎ | 25/40 [00:00<00:00, 128.16it/s]
/Users/guyattia/PycharmProjects/MSC-DL-Course/hw2/hw2/layers.py:104: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). x_positive = torch.tensor(dout * positive_grad) /Users/guyattia/PycharmProjects/MSC-DL-Course/hw2/hw2/layers.py:105: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). x_negative = torch.tensor(dout * negative_grad)
train_batch (Avg. Loss 3.777, Accuracy 11.2): 100%|██████████| 40/40 [00:00<00:00, 125.65it/s] test_batch (Avg. Loss 3.611, Accuracy 15.2): 100%|██████████| 40/40 [00:00<00:00, 248.37it/s] train_batch (Avg. Loss 2.732, Accuracy 20.0): 100%|██████████| 40/40 [00:00<00:00, 125.79it/s] test_batch (Avg. Loss 2.741, Accuracy 20.2): 100%|██████████| 40/40 [00:00<00:00, 248.77it/s] train_batch (Avg. Loss 2.393, Accuracy 19.5): 100%|██████████| 40/40 [00:00<00:00, 127.32it/s] test_batch (Avg. Loss 2.561, Accuracy 23.1): 100%|██████████| 40/40 [00:00<00:00, 248.20it/s] train_batch (Avg. Loss 2.120, Accuracy 24.2): 100%|██████████| 40/40 [00:00<00:00, 125.17it/s] test_batch (Avg. Loss 2.377, Accuracy 23.4): 100%|██████████| 40/40 [00:00<00:00, 246.28it/s] train_batch (Avg. Loss 2.006, Accuracy 29.8): 100%|██████████| 40/40 [00:00<00:00, 128.99it/s] test_batch (Avg. Loss 2.309, Accuracy 25.6): 100%|██████████| 40/40 [00:00<00:00, 239.17it/s] train_batch (Avg. Loss 1.863, Accuracy 31.5): 100%|██████████| 40/40 [00:00<00:00, 125.87it/s] test_batch (Avg. Loss 2.273, Accuracy 25.5): 100%|██████████| 40/40 [00:00<00:00, 236.44it/s] --- EPOCH 7/30 --- train_batch (Avg. Loss 1.808, Accuracy 34.8): 100%|██████████| 40/40 [00:00<00:00, 129.12it/s] test_batch (Avg. Loss 2.317, Accuracy 27.1): 100%|██████████| 40/40 [00:00<00:00, 239.31it/s] train_batch (Avg. Loss 1.749, Accuracy 35.5): 100%|██████████| 40/40 [00:00<00:00, 129.77it/s] test_batch (Avg. Loss 2.290, Accuracy 27.8): 100%|██████████| 40/40 [00:00<00:00, 240.78it/s] train_batch (Avg. Loss 1.711, Accuracy 40.0): 100%|██████████| 40/40 [00:00<00:00, 129.78it/s] test_batch (Avg. Loss 2.303, Accuracy 28.9): 100%|██████████| 40/40 [00:00<00:00, 239.58it/s] train_batch (Avg. Loss 1.688, Accuracy 37.8): 100%|██████████| 40/40 [00:00<00:00, 129.43it/s] test_batch (Avg. Loss 2.279, Accuracy 29.1): 100%|██████████| 40/40 [00:00<00:00, 237.09it/s] train_batch (Avg. Loss 1.622, Accuracy 41.5): 100%|██████████| 40/40 [00:00<00:00, 129.88it/s] test_batch (Avg. Loss 2.323, Accuracy 28.9): 100%|██████████| 40/40 [00:00<00:00, 247.14it/s] train_batch (Avg. Loss 1.603, Accuracy 44.2): 100%|██████████| 40/40 [00:00<00:00, 125.75it/s] test_batch (Avg. Loss 2.342, Accuracy 27.1): 100%|██████████| 40/40 [00:00<00:00, 246.97it/s] --- EPOCH 13/30 --- train_batch (Avg. Loss 1.580, Accuracy 47.0): 100%|██████████| 40/40 [00:00<00:00, 126.04it/s] test_batch (Avg. Loss 2.301, Accuracy 28.5): 100%|██████████| 40/40 [00:00<00:00, 247.02it/s] train_batch (Avg. Loss 1.518, Accuracy 47.5): 100%|██████████| 40/40 [00:00<00:00, 126.54it/s] test_batch (Avg. Loss 2.305, Accuracy 28.5): 100%|██████████| 40/40 [00:00<00:00, 246.14it/s] train_batch (Avg. Loss 1.511, Accuracy 46.0): 100%|██████████| 40/40 [00:00<00:00, 125.86it/s] test_batch (Avg. Loss 2.290, Accuracy 28.8): 100%|██████████| 40/40 [00:00<00:00, 247.73it/s] train_batch (Avg. Loss 1.472, Accuracy 51.5): 100%|██████████| 40/40 [00:00<00:00, 105.06it/s] test_batch (Avg. Loss 2.302, Accuracy 29.4): 100%|██████████| 40/40 [00:00<00:00, 245.70it/s] train_batch (Avg. Loss 1.465, Accuracy 51.5): 100%|██████████| 40/40 [00:00<00:00, 125.32it/s] test_batch (Avg. Loss 2.309, Accuracy 29.4): 100%|██████████| 40/40 [00:00<00:00, 242.37it/s] train_batch (Avg. Loss 1.431, Accuracy 52.2): 100%|██████████| 40/40 [00:00<00:00, 124.18it/s] test_batch (Avg. Loss 2.304, Accuracy 30.1): 100%|██████████| 40/40 [00:00<00:00, 244.37it/s] --- EPOCH 19/30 --- train_batch (Avg. Loss 1.355, Accuracy 55.2): 100%|██████████| 40/40 [00:00<00:00, 124.01it/s] test_batch (Avg. Loss 2.351, Accuracy 28.9): 100%|██████████| 40/40 [00:00<00:00, 242.42it/s] train_batch (Avg. Loss 1.288, Accuracy 55.8): 100%|██████████| 40/40 [00:00<00:00, 124.09it/s] test_batch (Avg. Loss 2.355, Accuracy 29.6): 100%|██████████| 40/40 [00:00<00:00, 240.64it/s] train_batch (Avg. Loss 1.336, Accuracy 55.8): 100%|██████████| 40/40 [00:00<00:00, 122.59it/s] test_batch (Avg. Loss 2.386, Accuracy 29.9): 100%|██████████| 40/40 [00:00<00:00, 241.72it/s] train_batch (Avg. Loss 1.314, Accuracy 55.2): 100%|██████████| 40/40 [00:00<00:00, 123.35it/s] test_batch (Avg. Loss 2.354, Accuracy 29.8): 100%|██████████| 40/40 [00:00<00:00, 246.84it/s] train_batch (Avg. Loss 1.308, Accuracy 57.2): 100%|██████████| 40/40 [00:00<00:00, 126.51it/s] test_batch (Avg. Loss 2.353, Accuracy 30.0): 100%|██████████| 40/40 [00:00<00:00, 250.39it/s] train_batch (Avg. Loss 1.226, Accuracy 61.2): 100%|██████████| 40/40 [00:00<00:00, 125.56it/s] test_batch (Avg. Loss 2.353, Accuracy 30.6): 100%|██████████| 40/40 [00:00<00:00, 245.70it/s] --- EPOCH 25/30 --- train_batch (Avg. Loss 1.260, Accuracy 56.8): 100%|██████████| 40/40 [00:00<00:00, 122.50it/s] test_batch (Avg. Loss 2.376, Accuracy 29.9): 100%|██████████| 40/40 [00:00<00:00, 249.18it/s] train_batch (Avg. Loss 1.202, Accuracy 60.0): 100%|██████████| 40/40 [00:00<00:00, 125.70it/s] test_batch (Avg. Loss 2.397, Accuracy 28.2): 100%|██████████| 40/40 [00:00<00:00, 247.81it/s] train_batch (Avg. Loss 1.205, Accuracy 60.0): 100%|██████████| 40/40 [00:00<00:00, 127.75it/s] test_batch (Avg. Loss 2.409, Accuracy 29.0): 100%|██████████| 40/40 [00:00<00:00, 235.11it/s] train_batch (Avg. Loss 1.203, Accuracy 61.2): 100%|██████████| 40/40 [00:00<00:00, 128.12it/s] test_batch (Avg. Loss 2.404, Accuracy 29.1): 100%|██████████| 40/40 [00:00<00:00, 238.64it/s] train_batch (Avg. Loss 1.118, Accuracy 64.2): 100%|██████████| 40/40 [00:00<00:00, 127.93it/s] test_batch (Avg. Loss 2.426, Accuracy 30.2): 100%|██████████| 40/40 [00:00<00:00, 238.71it/s] --- EPOCH 30/30 --- train_batch (Avg. Loss 1.149, Accuracy 62.0): 100%|██████████| 40/40 [00:00<00:00, 128.87it/s] test_batch (Avg. Loss 2.408, Accuracy 29.1): 100%|██████████| 40/40 [00:00<00:00, 239.42it/s] *** Training with dropout= 0.8 --- EPOCH 1/30 --- train_batch (3.965): 62%|██████▎ | 25/40 [00:00<00:00, 130.77it/s]
/Users/guyattia/PycharmProjects/MSC-DL-Course/hw2/hw2/layers.py:104: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). x_positive = torch.tensor(dout * positive_grad) /Users/guyattia/PycharmProjects/MSC-DL-Course/hw2/hw2/layers.py:105: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor). x_negative = torch.tensor(dout * negative_grad)
train_batch (Avg. Loss 3.271, Accuracy 9.2): 100%|██████████| 40/40 [00:00<00:00, 126.27it/s] test_batch (Avg. Loss 3.340, Accuracy 10.5): 100%|██████████| 40/40 [00:00<00:00, 248.96it/s] train_batch (Avg. Loss 2.687, Accuracy 10.8): 100%|██████████| 40/40 [00:00<00:00, 125.79it/s] test_batch (Avg. Loss 3.082, Accuracy 13.4): 100%|██████████| 40/40 [00:00<00:00, 242.27it/s] train_batch (Avg. Loss 2.533, Accuracy 10.8): 100%|██████████| 40/40 [00:00<00:00, 127.53it/s] test_batch (Avg. Loss 3.004, Accuracy 16.8): 100%|██████████| 40/40 [00:00<00:00, 248.02it/s] train_batch (Avg. Loss 2.419, Accuracy 14.8): 100%|██████████| 40/40 [00:00<00:00, 129.97it/s] test_batch (Avg. Loss 2.743, Accuracy 17.6): 100%|██████████| 40/40 [00:00<00:00, 235.01it/s] train_batch (Avg. Loss 2.376, Accuracy 16.5): 100%|██████████| 40/40 [00:00<00:00, 122.65it/s] test_batch (Avg. Loss 2.676, Accuracy 18.2): 100%|██████████| 40/40 [00:00<00:00, 233.74it/s] train_batch (Avg. Loss 2.281, Accuracy 13.2): 100%|██████████| 40/40 [00:00<00:00, 127.77it/s] test_batch (Avg. Loss 2.646, Accuracy 18.6): 100%|██████████| 40/40 [00:00<00:00, 231.94it/s] --- EPOCH 7/30 --- train_batch (Avg. Loss 2.228, Accuracy 21.2): 100%|██████████| 40/40 [00:00<00:00, 126.62it/s] test_batch (Avg. Loss 2.659, Accuracy 18.5): 100%|██████████| 40/40 [00:00<00:00, 231.92it/s] train_batch (Avg. Loss 2.240, Accuracy 18.2): 100%|██████████| 40/40 [00:00<00:00, 127.42it/s] test_batch (Avg. Loss 2.590, Accuracy 20.9): 100%|██████████| 40/40 [00:00<00:00, 234.07it/s] train_batch (Avg. Loss 2.229, Accuracy 18.8): 100%|██████████| 40/40 [00:00<00:00, 127.11it/s] test_batch (Avg. Loss 2.643, Accuracy 17.8): 100%|██████████| 40/40 [00:00<00:00, 232.40it/s] train_batch (Avg. Loss 2.183, Accuracy 17.5): 100%|██████████| 40/40 [00:00<00:00, 130.13it/s] test_batch (Avg. Loss 2.601, Accuracy 19.9): 100%|██████████| 40/40 [00:00<00:00, 236.88it/s] train_batch (Avg. Loss 2.149, Accuracy 23.5): 100%|██████████| 40/40 [00:00<00:00, 129.38it/s] test_batch (Avg. Loss 2.609, Accuracy 20.9): 100%|██████████| 40/40 [00:00<00:00, 249.76it/s] train_batch (Avg. Loss 2.127, Accuracy 22.5): 100%|██████████| 40/40 [00:00<00:00, 126.39it/s] test_batch (Avg. Loss 2.611, Accuracy 21.6): 100%|██████████| 40/40 [00:00<00:00, 246.35it/s] --- EPOCH 13/30 --- train_batch (Avg. Loss 2.101, Accuracy 23.5): 100%|██████████| 40/40 [00:00<00:00, 126.19it/s] test_batch (Avg. Loss 2.666, Accuracy 23.1): 100%|██████████| 40/40 [00:00<00:00, 248.19it/s] train_batch (Avg. Loss 2.129, Accuracy 24.2): 100%|██████████| 40/40 [00:00<00:00, 125.43it/s] test_batch (Avg. Loss 2.549, Accuracy 23.0): 100%|██████████| 40/40 [00:00<00:00, 246.95it/s] train_batch (Avg. Loss 2.063, Accuracy 25.5): 100%|██████████| 40/40 [00:00<00:00, 125.72it/s] test_batch (Avg. Loss 2.594, Accuracy 24.9): 100%|██████████| 40/40 [00:00<00:00, 238.28it/s] train_batch (Avg. Loss 2.073, Accuracy 24.0): 100%|██████████| 40/40 [00:00<00:00, 125.22it/s] test_batch (Avg. Loss 2.582, Accuracy 24.8): 100%|██████████| 40/40 [00:00<00:00, 248.23it/s] train_batch (Avg. Loss 2.086, Accuracy 26.8): 100%|██████████| 40/40 [00:00<00:00, 125.08it/s] test_batch (Avg. Loss 2.673, Accuracy 23.2): 100%|██████████| 40/40 [00:00<00:00, 247.69it/s] train_batch (Avg. Loss 2.062, Accuracy 25.0): 100%|██████████| 40/40 [00:00<00:00, 123.29it/s] test_batch (Avg. Loss 2.698, Accuracy 24.1): 100%|██████████| 40/40 [00:00<00:00, 241.87it/s] --- EPOCH 19/30 --- train_batch (Avg. Loss 2.024, Accuracy 26.8): 100%|██████████| 40/40 [00:00<00:00, 123.04it/s] test_batch (Avg. Loss 2.697, Accuracy 24.2): 100%|██████████| 40/40 [00:00<00:00, 241.98it/s] train_batch (Avg. Loss 2.093, Accuracy 25.0): 100%|██████████| 40/40 [00:00<00:00, 123.81it/s] test_batch (Avg. Loss 2.661, Accuracy 24.9): 100%|██████████| 40/40 [00:00<00:00, 242.25it/s] train_batch (Avg. Loss 1.982, Accuracy 27.8): 100%|██████████| 40/40 [00:00<00:00, 122.68it/s] test_batch (Avg. Loss 2.794, Accuracy 22.8): 100%|██████████| 40/40 [00:00<00:00, 243.37it/s] train_batch (Avg. Loss 2.044, Accuracy 25.8): 100%|██████████| 40/40 [00:00<00:00, 123.16it/s] test_batch (Avg. Loss 2.722, Accuracy 26.4): 100%|██████████| 40/40 [00:00<00:00, 238.05it/s] train_batch (Avg. Loss 2.019, Accuracy 28.0): 100%|██████████| 40/40 [00:00<00:00, 122.67it/s] test_batch (Avg. Loss 2.699, Accuracy 24.9): 100%|██████████| 40/40 [00:00<00:00, 241.35it/s] train_batch (Avg. Loss 2.011, Accuracy 29.5): 100%|██████████| 40/40 [00:00<00:00, 123.12it/s] test_batch (Avg. Loss 2.687, Accuracy 24.5): 100%|██████████| 40/40 [00:00<00:00, 246.86it/s] --- EPOCH 25/30 --- train_batch (Avg. Loss 2.024, Accuracy 25.8): 100%|██████████| 40/40 [00:00<00:00, 127.31it/s] test_batch (Avg. Loss 2.767, Accuracy 26.1): 100%|██████████| 40/40 [00:00<00:00, 246.63it/s] train_batch (Avg. Loss 1.998, Accuracy 28.0): 100%|██████████| 40/40 [00:00<00:00, 128.82it/s] test_batch (Avg. Loss 2.756, Accuracy 25.2): 100%|██████████| 40/40 [00:00<00:00, 236.07it/s] train_batch (Avg. Loss 2.014, Accuracy 25.0): 100%|██████████| 40/40 [00:00<00:00, 126.78it/s] test_batch (Avg. Loss 2.761, Accuracy 25.2): 100%|██████████| 40/40 [00:00<00:00, 231.04it/s] train_batch (Avg. Loss 1.975, Accuracy 28.8): 100%|██████████| 40/40 [00:00<00:00, 129.32it/s] test_batch (Avg. Loss 2.801, Accuracy 25.8): 100%|██████████| 40/40 [00:00<00:00, 237.59it/s] train_batch (Avg. Loss 1.967, Accuracy 29.8): 100%|██████████| 40/40 [00:00<00:00, 129.61it/s] test_batch (Avg. Loss 2.768, Accuracy 25.5): 100%|██████████| 40/40 [00:00<00:00, 236.10it/s] --- EPOCH 30/30 --- train_batch (Avg. Loss 1.973, Accuracy 29.0): 100%|██████████| 40/40 [00:00<00:00, 129.37it/s] test_batch (Avg. Loss 2.839, Accuracy 25.8): 100%|██████████| 40/40 [00:00<00:00, 236.66it/s]
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw2/answers.py.
from cs3600.answers import display_answer
import hw2.answers
Regarding the graphs you got for the three dropout configurations:
Explain the graphs of no-dropout vs dropout. Do they match what you expected to see?
Compare the low-dropout setting to the high-dropout setting and explain based on your graphs.
display_answer(hw2.answers.part2_q1)
We can clearly see that without dropout we got a situation of overfitting the training set- from epoch 5 only the train set accuracy increasing while accuracy on test set is stuck. With dropout, ther is a more wanted picture, where the test set accuracy keeps increasing while the training set also increasing (maybe there is a slightly noticable plato of the test accuracy at the last epochs). 2. We can see that too large dropout can give negative affect on the training procedure. In our case, the using of dropout=0.8 gave lower accuracy both on the train and test sets.
When training a model with the cross-entropy loss function, is it possible for the test loss to increase for a few epochs while the test accuracy also increases?
If it's possible explain how, if it's not explain why not.
display_answer(hw2.answers.part2_q2)
The cross-entropy loss depends on the softmax function, maeaning that it is not only determined by if the classifier was right or wrong (binary), it is also affected by how much the prediction was positive and close to the right classification.
So, there can be a situation where the accuracy is increasing because more and more instances are classified correctly, but the desicion of the classifier is more ambiguous, what will increase also the loss value.
In this part we will explore convolution networks and the effects of their architecture on accuracy. We'll implement a common block-based deep CNN pattern and we'll perform various experiments on it while varying the architecture. Then we'll implement our own custom architecture to see whether we can get high classification results on a large subset of CIFAR-10.
Training will be performed on GPU.
# from google.colab import drive
# drive.mount('/content/gdrive')
# %cd '/content/gdrive/My Drive/Studies 2/Year 2/Semester A/Deep Learning/HW 2'
import os
import re
import sys
import glob
import numpy as np
import matplotlib.pyplot as plt
import unittest
import torch
import torchvision
import torchvision.transforms as tvtf
%matplotlib inline
%load_ext autoreload
%autoreload 2
seed = 42
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
plt.rcParams.update({'font.size': 12})
test = unittest.TestCase()
Convolutional layers are the most essential building blocks of the state of the art deep learning image classification models and also play an important role in many other tasks. As we saw in the tutorial, when applied to images, convolutional layers operate on and produce volumes (3D tensors) of activations.
A convenient way to interpret convolutional layers for images is as a collection of 3D learnable filters, each of which operates on a small spatial region of the input volume. Each filter is convolved with the input volume ("slides over it"), and a dot product is computed at each location followed by a non-linearity which produces one activation. All these activations produce a 2D plane known as a feature map. Multiple feature maps (one for each filter) comprise the output volume.

A crucial property of convolutional layers is their translation equivariance, i.e. shifting the input results in and equivalently shifted output. This produces the ability to detect features regardless of their spatial location in the input.
Convolutional network architectures usually follow a pattern basic repeating blocks: one or more convolution layers, each followed by a non-linearity (generally ReLU) and then a pooling layer to reduce spatial dimensions. Usually, the number of convolutional filters increases the deeper they are in the network. These layers are meant to extract features from the input. Then, one or more fully-connected layers is used to combine the extracted features into the required number of output class scores.
PyTorch provides all the basic building blocks needed for creating a convolutional arcitecture within the torch.nn package.
Let's use them to create a basic convolutional network with the following architecture pattern:
[(CONV -> ACT)*P -> POOL]*(N/P) -> (FC -> ACT)*M -> FC
Here $N$ is the total number of convolutional layers, $P$ specifies how many convolutions to perform before each pooling layer and $M$ specifies the number of hidden fully-connected layers before the final output layer.
TODO: Complete the implementaion of the ConvClassifier class in the hw2/cnn.py module.
Use PyTorch's nn.Conv2d and nn.MaxPool2d for the convolution and pooling layers.
It's recommended to implement the missing functionality in the order of the class' methods.
import hw2.cnn as cnn
test_params = [
dict(
in_size=(3,100,100), out_classes=10,
channels=[32]*4, pool_every=2, hidden_dims=[100]*2,
conv_params=dict(kernel_size=3, stride=1, padding=1),
activation_type='relu', activation_params=dict(),
pooling_type='max', pooling_params=dict(kernel_size=2),
),
dict(
in_size=(3,100,100), out_classes=10,
channels=[32]*4, pool_every=2, hidden_dims=[100]*2,
conv_params=dict(kernel_size=5, stride=2, padding=3),
activation_type='lrelu', activation_params=dict(negative_slope=0.05),
pooling_type='avg', pooling_params=dict(kernel_size=3),
),
]
for i, params in enumerate(test_params):
torch.manual_seed(seed)
net = cnn.ConvClassifier(**params)
print(f"\n=== test {i} ===")
print(net)
test_image = torch.randint(low=0, high=256, size=(3, 100, 100), dtype=torch.float).unsqueeze(0)
test_out = net(test_image)
print(f'{test_out}')
expected_out = torch.load(f'tests/assets/expected_conv_out_{i:02d}.pt')
diff = torch.norm(test_out - expected_out).item()
print(f'{diff}')
test.assertLess(diff, 1e-3)
=== test 0 ===
ConvClassifier(
(feature_extractor): Sequential(
(0): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): ReLU()
(2): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(3): ReLU()
(4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(5): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(6): ReLU()
(7): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(8): ReLU()
(9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)
(classifier): Sequential(
(0): Linear(in_features=20000, out_features=100, bias=True)
(1): ReLU()
(2): Linear(in_features=100, out_features=100, bias=True)
(3): ReLU()
(4): Linear(in_features=100, out_features=10, bias=True)
)
)
tensor([[-0.0868, -0.3790, -0.4341, -0.1236, -0.2160, 0.1683, 0.4739, 0.0750,
0.1151, -0.1606]], grad_fn=<AddmmBackward>)
6.608794933526951e-07
=== test 1 ===
ConvClassifier(
(feature_extractor): Sequential(
(0): Conv2d(3, 32, kernel_size=(5, 5), stride=(2, 2), padding=(3, 3))
(1): LeakyReLU(negative_slope=0.05)
(2): Conv2d(32, 32, kernel_size=(5, 5), stride=(2, 2), padding=(3, 3))
(3): LeakyReLU(negative_slope=0.05)
(4): AvgPool2d(kernel_size=3, stride=3, padding=0)
(5): Conv2d(32, 32, kernel_size=(5, 5), stride=(2, 2), padding=(3, 3))
(6): LeakyReLU(negative_slope=0.05)
(7): Conv2d(32, 32, kernel_size=(5, 5), stride=(2, 2), padding=(3, 3))
(8): LeakyReLU(negative_slope=0.05)
(9): AvgPool2d(kernel_size=3, stride=3, padding=0)
)
(classifier): Sequential(
(0): Linear(in_features=32, out_features=100, bias=True)
(1): LeakyReLU(negative_slope=0.05)
(2): Linear(in_features=100, out_features=100, bias=True)
(3): LeakyReLU(negative_slope=0.05)
(4): Linear(in_features=100, out_features=10, bias=True)
)
)
tensor([[ 0.1617, 0.0090, 0.1085, -0.0883, 0.0238, -0.1273, -0.1251, -0.0495,
-0.0356, 0.1318]], grad_fn=<AddmmBackward>)
0.0
Let's load CIFAR-10 again to use as our dataset.
data_dir = os.path.expanduser('~/.pytorch-datasets')
ds_train = torchvision.datasets.CIFAR10(root=data_dir, download=True, train=True, transform=tvtf.ToTensor())
ds_test = torchvision.datasets.CIFAR10(root=data_dir, download=True, train=False, transform=tvtf.ToTensor())
print(f'Train: {len(ds_train)} samples')
print(f'Test: {len(ds_test)} samples')
x0,_ = ds_train[0]
in_size = x0.shape
num_classes = 10
print('input image size =', in_size)
Files already downloaded and verified Files already downloaded and verified Train: 50000 samples Test: 10000 samples input image size = torch.Size([3, 32, 32])
Now as usual, as a sanity test let's make sure we can overfit a tiny dataset with our model. But first we need to adapt our Trainer for PyTorch models.
TODO: Complete the implementaion of the TorchTrainer class in the hw2/training.py module.
import hw2.training as training
torch.manual_seed(seed)
# Define a tiny part of the CIFAR-10 dataset to overfit it
batch_size = 2
max_batches = 25
dl_train = torch.utils.data.DataLoader(ds_train, batch_size, shuffle=False)
# Create model, loss and optimizer instances
model = cnn.ConvClassifier(
in_size, num_classes, channels=[32], pool_every=1, hidden_dims=[100],
conv_params=dict(kernel_size=3, stride=1, padding=1),
pooling_params=dict(kernel_size=2),
)
loss_fn = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9,)
# Use TorchTrainer to run only the training loop a few times.
trainer = training.TorchTrainer(model, loss_fn, optimizer, device)
best_acc = 0
for i in range(30):
res = trainer.train_epoch(dl_train, max_batches=max_batches, verbose=(i%2==0))
best_acc = res.accuracy if res.accuracy > best_acc else best_acc
# Test overfitting
test.assertGreaterEqual(best_acc, 95)
train_batch (Avg. Loss 2.371, Accuracy 6.0): 100%|██████████| 25/25 [00:00<00:00, 181.00it/s] train_batch (Avg. Loss 2.238, Accuracy 16.0): 100%|██████████| 25/25 [00:00<00:00, 203.12it/s] train_batch (Avg. Loss 2.131, Accuracy 22.0): 100%|██████████| 25/25 [00:00<00:00, 190.72it/s] train_batch (Avg. Loss 1.830, Accuracy 36.0): 100%|██████████| 25/25 [00:00<00:00, 200.89it/s] train_batch (Avg. Loss 1.166, Accuracy 58.0): 100%|██████████| 25/25 [00:00<00:00, 218.69it/s] train_batch (Avg. Loss 1.213, Accuracy 56.0): 100%|██████████| 25/25 [00:00<00:00, 215.93it/s] train_batch (Avg. Loss 0.607, Accuracy 82.0): 100%|██████████| 25/25 [00:00<00:00, 216.10it/s] train_batch (Avg. Loss 0.701, Accuracy 74.0): 100%|██████████| 25/25 [00:00<00:00, 216.47it/s] train_batch (Avg. Loss 0.059, Accuracy 100.0): 100%|██████████| 25/25 [00:00<00:00, 202.37it/s] train_batch (Avg. Loss 0.007, Accuracy 100.0): 100%|██████████| 25/25 [00:00<00:00, 219.17it/s] train_batch (Avg. Loss 0.001, Accuracy 100.0): 100%|██████████| 25/25 [00:00<00:00, 217.35it/s] train_batch (Avg. Loss 0.001, Accuracy 100.0): 100%|██████████| 25/25 [00:00<00:00, 218.27it/s] train_batch (Avg. Loss 0.000, Accuracy 100.0): 100%|██████████| 25/25 [00:00<00:00, 218.70it/s] train_batch (Avg. Loss 0.000, Accuracy 100.0): 100%|██████████| 25/25 [00:00<00:00, 206.13it/s] train_batch (Avg. Loss 0.000, Accuracy 100.0): 100%|██████████| 25/25 [00:00<00:00, 218.45it/s]
A very common addition to the basic convolutional architecture described above are shortcut connections. First proposed by He et al. (2016), this simple addition has been shown to be crucial ingredient in order to achieve effective learning with very deep networks. Virtually all state of the art image classification models from recent years use this technique.
The idea is to add an shortcut, or skip, around every two or more convolutional layers:

This adds an easy way for the network to learn identity mappings: set the weight values to be very small. The consequence is that the convolutional layers to learn a residual mapping, i.e. some delta that is applied to the identity map, instead of actually learning a completely new mapping from scratch.
Lets start by implementing a general residual block, representing a structure similar to the above diagrams. Our residual block will be composed of:
1x1 convolution to project the channel dimension.TODO: Complete the implementation of the ResidualBlock's __init__() method in the hw2/cnn.py module.
torch.manual_seed(seed)
resblock = cnn.ResidualBlock(
in_channels=3, channels=[6, 4]*2, kernel_sizes=[3, 5]*2,
batchnorm=True, dropout=0.2
)
print(resblock)
test_out = resblock(torch.zeros(1, 3, 32, 32))
print(f'{test_out.shape}')
expected_out = torch.load('tests/assets/expected_resblock_out.pt')
test.assertLess(torch.norm(test_out - expected_out).item(), 1e-3)
ResidualBlock(
(main_path): Sequential(
(0): Conv2d(3, 6, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Dropout2d(p=0.2, inplace=False)
(2): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): ReLU()
(4): Conv2d(6, 4, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(5): Dropout2d(p=0.2, inplace=False)
(6): BatchNorm2d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): ReLU()
(8): Conv2d(4, 6, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(9): Dropout2d(p=0.2, inplace=False)
(10): BatchNorm2d(6, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): ReLU()
(12): Conv2d(6, 4, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
)
(shortcut_path): Sequential(
(0): Conv2d(3, 4, kernel_size=(1, 1), stride=(1, 1), bias=False)
)
)
torch.Size([1, 4, 32, 32])
Now, based on the ResidualBlock, we'll implement our own variation of a residual network (ResNet),
with the following architecture:
[-> (CONV -> ACT)*P -> POOL]*(N/P) -> (FC -> ACT)*M -> FC
\------- SKIP ------/
Note that $N$, $P$ and $M$ are as before, however now $P$ also controls the number of convolutional layers to add a skip-connection to.
In the ResNet Block diagram shown above, the right block is called a bottleneck block. This type of block is mainly used deep in the network, where the feature space becomes increasingly high-dimensional (i.e. there are many channels).
Instead of applying a KxK conv layer on the original input channels, a bottleneck block first projects to a lower number of features (channels), applies the KxK conv on the result, and then projects back to the original feature space. Both projections are performed with 1x1 convolutions.
TODO: Complete the implementation of the ResidualBottleneckBlock in the hw2/cnn.py module.
torch.manual_seed(seed)
resblock_bn = cnn.ResidualBottleneckBlock(
in_out_channels=256, inner_channels=[64, 32, 64], inner_kernel_sizes=[3, 5, 3],
batchnorm=False, dropout=0.1, activation_type="lrelu"
)
print(resblock_bn)
# Test a forward pass
test_in = torch.zeros(1, 256, 32, 32)
test_out = resblock_bn(test_in)
print(f'{test_out.shape}')
assert test_out.shape == test_in.shape
expected_out = torch.load('tests/assets/expected_resblock_bn_out.pt')
test.assertLess(torch.norm(test_out - expected_out).item(), 1e-3)
ResidualBottleneckBlock(
(main_path): Sequential(
(0): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): LeakyReLU(negative_slope=0.01)
(3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): Dropout2d(p=0.1, inplace=False)
(5): LeakyReLU(negative_slope=0.01)
(6): Conv2d(64, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(7): Dropout2d(p=0.1, inplace=False)
(8): LeakyReLU(negative_slope=0.01)
(9): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(10): Dropout2d(p=0.1, inplace=False)
(11): LeakyReLU(negative_slope=0.01)
(12): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1))
)
(shortcut_path): Sequential()
)
torch.Size([1, 256, 32, 32])
TODO: Complete the implementation of the ResNetClassifier class in the hw2/cnn.py module.
You must use your ResidualBlocks to group together every $P$ convolutional layers.
torch.manual_seed(seed)
net = cnn.ResNetClassifier(
in_size=(3,100,100), out_classes=10, channels=[32, 64]*3,
pool_every=4, hidden_dims=[100]*2,
activation_type='lrelu', activation_params=dict(negative_slope=0.01),
pooling_type='avg', pooling_params=dict(kernel_size=2),
batchnorm=True, dropout=0.1,
)
print(net)
torch.manual_seed(seed)
test_image = torch.randint(low=0, high=256, size=(3, 100, 100), dtype=torch.float).unsqueeze(0)
test_out = net(test_image)
print('out =', test_out)
expected_out = torch.load('tests/assets/expected_resnet_out_nofp.pt')
test.assertLess(
torch.norm(test_out - expected_out).item(), 1e-3
)
ResNetClassifier(
(feature_extractor): Sequential(
(0): ResidualBlock(
(main_path): Sequential(
(0): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): LeakyReLU(negative_slope=0.01)
(4): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(5): Dropout2d(p=0.1, inplace=False)
(6): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): LeakyReLU(negative_slope=0.01)
(8): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(9): Dropout2d(p=0.1, inplace=False)
(10): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(11): LeakyReLU(negative_slope=0.01)
(12): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
(shortcut_path): Sequential(
(0): Conv2d(3, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
)
)
(1): AvgPool2d(kernel_size=2, stride=2, padding=0)
(2): ResidualBlock(
(main_path): Sequential(
(0): Conv2d(64, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Dropout2d(p=0.1, inplace=False)
(2): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): LeakyReLU(negative_slope=0.01)
(4): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
(shortcut_path): Sequential()
)
)
(classifier): Sequential(
(0): Linear(in_features=160000, out_features=100, bias=True)
(1): LeakyReLU(negative_slope=0.01)
(2): Linear(in_features=100, out_features=100, bias=True)
(3): LeakyReLU(negative_slope=0.01)
(4): Linear(in_features=100, out_features=10, bias=True)
)
)
out = tensor([[ 0.2462, -2.0466, 10.2188, -1.6095, -2.4464, 9.1817, 2.0589, 1.6466,
7.3873, 2.7356]], grad_fn=<AddmmBackward>)
You will now perform a series of experiments that train various model configurations on a much larger part of the CIFAR-10 dataset.
To perform the experiments, you'll need to use a machine with a GPU since training time might be too long otherwise.
you can train your models with CPU, yet we recommand to use a server like google colab, for the server you will need to upload manually (or use mount from your google drive) all of the files and this notebook.
please note that when you use colab, to run a shell command, use '!' before hand, for instance:
! python -m hw2.experiments run-exp -n exp1_1 ....
you can also add blocks to this notebook to run the experiments
Here's an example of running a forward pass on the GPU (assuming you're running this notebook on a GPU-enabled machine).
net = net.to(device)
test_image = test_image.to(device)
test_out = net(test_image)
Notice how we called .to(device) on both the model and the input tensor.
Here the device is a torch.device object that we created above. If an nvidia GPU is available on the machine you're running this on, the device will be 'cuda'. When you run .to(device) on a model, it recursively goes over all the model parameter tensors and copies their memory to the GPU. Similarly, calling .to(device) on the input image also copies it.
In order to train on a GPU, you need to make sure to move all your tensors to it. You'll get errors if you try to mix CPU and GPU tensors in a computation.
print(f'This notebook is running with device={device}')
print(f'The model parameter tensors are also on device={next(net.parameters()).device}')
print(f'The test image is also on device={test_image.device}')
print(f'The output is therefore also on device={test_out.device}')
This notebook is running with device=cpu The model parameter tensors are also on device=cpu The test image is also on device=cpu The output is therefore also on device=cpu
results folder on your local machine.
This notebook will only display the results, not run the actual experiment code (except for a demo run).run_name parameter that will also be the base name of the results file which this
notebook will expect to load.hw2/experiments.py module.
This module has a CLI parser so that you can invoke it as a script and pass in all the
configuration parameters for a single experiment run.python -m hw2.experiments run-exp to run an experiment, and not
python hw2/experiments.py run-exp, regardless of how/where you run it.In this part we will test some different architecture configurations based on our ConvClassifier and ResNetClassifier.
Specifically, we want to try different depths and number of features to see the effects these parameters have on the model's performance.
To do this, we'll define two extra hyperparameters for our model, K (filters_per_layer) and L (layers_per_block).
K is a list, containing the number of filters we want to have in our conv layers.L is the number of consecutive layers with the same number of filters to use.For example, if K=[32, 64] and L=2 it means we want two conv layers with 32 filters followed by two conv layers with 64 filters. If we also use pool_every=3, the feature-extraction part of our model will be:
Conv(X,32)->ReLu->Conv(32,32)->ReLU->Conv(32,64)->ReLU->MaxPool->Conv(64,64)->ReLU
We'll try various values of the K and L parameters in combination and see how each architecture trains. All other hyperparameters are up to you, including the choice of the optimization algorithm, the learning rate, regularization and architecture hyperparams such as pool_every and hidden_dims. Note that you should select the pool_every parameter wisely per experiment so that you don't end up with zero-width feature maps.
You can try some short manual runs to determine some good values for the hyperparameters or implement cross-validation to do it. However, the dataset size you test on should be large. Use at least ~20000 training images and ~6000 validation images.
The important thing is that you state what you used, how you decided on it, and explain your results based on that.
First we need to write some code to run the experiment.
TODO:
run_experiment() function in the hw2/experiments.py module.Trainer class.The following block tests that your implementation works. It's also meant to show you that each experiment run creates a result file containing the parameters to reproduce and the FitResult object for plotting.
import hw2.experiments as experiments
from hw2.experiments import load_experiment
from cs3600.plot import plot_fit
# Test experiment1 implementation on a few data samples and with a small model
experiments.run_experiment(
'test_run', seed=seed, bs_train=50, batches=10, epochs=10, early_stopping=5,
filters_per_layer=[32,64], layers_per_block=1, pool_every=1, hidden_dims=[100],
model_type='resnet',
)
# There should now be a file 'test_run.json' in your `results/` folder.
# We can use it to load the results of the experiment.
cfg, fit_res = load_experiment('results/test_run_L1_K32-64.json')
_, _ = plot_fit(fit_res)
# And `cfg` contains the exact parameters to reproduce it
print('experiment config: ', cfg)
Files already downloaded and verified
Files already downloaded and verified
--- EPOCH 1/10 ---
train_batch (Avg. Loss 2.299, Accuracy 11.8): 100%|██████████| 10/10 [00:00<00:00, 14.66it/s]
test_batch (Avg. Loss 2.290, Accuracy 7.5): 100%|██████████| 10/10 [00:00<00:00, 126.23it/s]
train_batch (Avg. Loss 2.254, Accuracy 14.8): 100%|██████████| 10/10 [00:00<00:00, 15.31it/s]
test_batch (Avg. Loss 2.242, Accuracy 12.5): 100%|██████████| 10/10 [00:00<00:00, 113.76it/s]
train_batch (Avg. Loss 2.175, Accuracy 19.8): 100%|██████████| 10/10 [00:00<00:00, 15.53it/s]
test_batch (Avg. Loss 2.136, Accuracy 29.2): 100%|██████████| 10/10 [00:00<00:00, 130.65it/s]
train_batch (Avg. Loss 2.068, Accuracy 26.8): 100%|██████████| 10/10 [00:00<00:00, 15.50it/s]
test_batch (Avg. Loss 2.073, Accuracy 28.3): 100%|██████████| 10/10 [00:00<00:00, 129.16it/s]
*** Output file ./results/test_run_L1_K32-64.json written
experiment config: {'run_name': 'test_run', 'out_dir': './results', 'seed': 42, 'device': None, 'bs_train': 50, 'bs_test': 12, 'batches': 10, 'epochs': 10, 'early_stopping': 5, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'filters_per_layer': [32, 64], 'pool_every': 1, 'hidden_dims': [1024, 100], 'model_type': 'resnet', 'conv_params': {'kernel_size': 3, 'stride': [1, 1], 'padding': 1}, 'activation_type': 'relu', 'activation_params': {}, 'pooling_type': 'avg', 'pooling_params': {'kernel_size': 2}, 'batchnorm': True, 'dropout': 0.1, 'kw': {}, 'layers_per_block': 1}
We'll use the following function to load multiple experiment results and plot them together.
def plot_exp_results(filename_pattern, results_dir='results'):
fig = None
result_files = glob.glob(os.path.join(results_dir, filename_pattern))
result_files.sort()
if len(result_files) == 0:
print(f'No results found for pattern {filename_pattern}.', file=sys.stderr)
return
for filepath in result_files:
m = re.match('exp\d_(\d_)?(.*)\.json', os.path.basename(filepath))
cfg, fit_res = load_experiment(filepath)
fig, axes = plot_fit(fit_res, fig, log_loss=True)#, legend=m[2])
del cfg['filters_per_layer']
del cfg['layers_per_block']
print('common config: ', cfg)
L)¶First, we'll test the effect of the network depth on training.
Configuratons:
K=32 fixed, with L=2,4,8,16 varying per runK=64 fixed, with L=2,4,8,16 varying per runSo 8 different runs in total.
Naming runs:
Each run should be named exp1_1_L{}_K{} where the braces are placeholders for the values. For example, the first run should be named exp1_1_L2_K32.
TODO: Run the experiment on the above configuration with the ConvClassifier model. Make sure the result file names are as expected. Use the following blocks to display the results.
plot_exp_results('exp1_1_L*_K32*.json')
common config: {'run_name': 'exp1_1', 'out_dir': './results', 'seed': 219523951, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 50, 'epochs': 100, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 4, 'hidden_dims': [2048, 100], 'model_type': 'cnn', 'conv_params': {'kernel_size': 3, 'stride': [1, 1], 'padding': [1, 1]}, 'activation_type': 'relu', 'activation_params': {}, 'pooling_type': 'avg', 'pooling_params': {'kernel_size': 2}, 'batchnorm': True, 'dropout': 0.1, 'kw': {}}
plot_exp_results('exp1_1_L*_K64*.json')
common config: {'run_name': 'exp1_1', 'out_dir': './results', 'seed': 190133811, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 50, 'epochs': 100, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 4, 'hidden_dims': [4096, 100], 'model_type': 'cnn', 'conv_params': {'kernel_size': 3, 'stride': [1, 1], 'padding': [1, 1]}, 'activation_type': 'relu', 'activation_params': {}, 'pooling_type': 'avg', 'pooling_params': {'kernel_size': 2}, 'batchnorm': True, 'dropout': 0.1, 'kw': {}}
K)¶Now we'll test the effect of the number of convolutional filters in each layer.
Configuratons:
L=2 fixed, with K=[32],[64],[128],[256] varying per run.L=4 fixed, with K=[32],[64],[128],[256] varying per run.L=8 fixed, with K=[32],[64],[128],[256] varying per run.So 12 different runs in total. To clarify, each run K takes the value of a list with a single element.
Naming runs:
Each run should be named exp1_2_L{}_K{} where the braces are placeholders for the values. For example, the first run should be named exp1_2_L2_K32.
TODO: Run the experiment on the above configuration with the ConvClassifier model. Make sure the result file names are as expected. Use the following blocks to display the results.
plot_exp_results('exp1_2_L2*.json')
common config: {'run_name': 'exp1_2', 'out_dir': './results', 'seed': 1165108623, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 500, 'epochs': 100, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 4, 'hidden_dims': [4096, 100], 'model_type': 'cnn', 'conv_params': {'kernel_size': 3, 'stride': [1, 1], 'padding': [1, 1]}, 'activation_type': 'relu', 'activation_params': {}, 'pooling_type': 'avg', 'pooling_params': {'kernel_size': 2}, 'batchnorm': True, 'dropout': 0.1, 'kw': {}}
plot_exp_results('exp1_2_L4*.json')
common config: {'run_name': 'exp1_2', 'out_dir': './results', 'seed': 1575099849, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 500, 'epochs': 100, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 4, 'hidden_dims': [16384, 100], 'model_type': 'cnn', 'conv_params': {'kernel_size': 3, 'stride': [1, 1], 'padding': [1, 1]}, 'activation_type': 'relu', 'activation_params': {}, 'pooling_type': 'avg', 'pooling_params': {'kernel_size': 2}, 'batchnorm': True, 'dropout': 0.1, 'kw': {}}
plot_exp_results('exp1_2_L8*.json')
common config: {'run_name': 'exp1_2', 'out_dir': './results', 'seed': 1458640907, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 500, 'epochs': 100, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 4, 'hidden_dims': [4096, 100], 'model_type': 'cnn', 'conv_params': {'kernel_size': 3, 'stride': [1, 1], 'padding': [1, 1]}, 'activation_type': 'relu', 'activation_params': {}, 'pooling_type': 'avg', 'pooling_params': {'kernel_size': 2}, 'batchnorm': True, 'dropout': 0.1, 'kw': {}}
K) and network depth (L)¶Now we'll test the effect of the number of convolutional filters in each layer.
Configuratons:
K=[64, 128, 256] fixed with L=1,2,3,4 varying per run.So 4 different runs in total. To clarify, each run K takes the value of an array with a three elements.
Naming runs:
Each run should be named exp1_3_L{}_K{}-{}-{} where the braces are placeholders for the values. For example, the first run should be named exp1_3_L1_K64-128-256.
TODO: Run the experiment on the above configuration with the ConvClassifier model. Make sure the result file names are as expected. Use the following blocks to display the results.
plot_exp_results('exp1_3*.json')
common config: {'run_name': 'exp1_3', 'out_dir': './results', 'seed': 1782796579, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 500, 'epochs': 100, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 4, 'hidden_dims': [4096, 100], 'model_type': 'cnn', 'conv_params': {'kernel_size': 3, 'stride': [1, 1], 'padding': [1, 1]}, 'activation_type': 'relu', 'activation_params': {}, 'pooling_type': 'avg', 'pooling_params': {'kernel_size': 2}, 'batchnorm': True, 'dropout': 0.1, 'kw': {}}
Now we'll test the effect of skip connections on the training and performance.
Configuratons:
K=[32] fixed with L=8,16,32 varying per run.K=[64, 128, 256] fixed with L=2,4,8 varying per run.So 6 different runs in total.
Naming runs:
Each run should be named exp1_4_L{}_K{}-{}-{} where the braces are placeholders for the values.
TODO: Run the experiment on the above configuration with the ResNetClassifier model. Make sure the result file names are as expected. Use the following blocks to display the results.
plot_exp_results('exp1_4_L*_K32.json')
common config: {'run_name': 'exp1_4', 'out_dir': './results', 'seed': 961657123, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 500, 'epochs': 100, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 4, 'hidden_dims': [2048, 100], 'model_type': 'resnet', 'conv_params': {'kernel_size': 3, 'stride': [1, 1], 'padding': 1}, 'activation_type': 'relu', 'activation_params': {}, 'pooling_type': 'avg', 'pooling_params': {'kernel_size': 2}, 'batchnorm': True, 'dropout': 0.1, 'kw': {}}
plot_exp_results('exp1_4_L*_K64*.json')
common config: {'run_name': 'exp1_4', 'out_dir': './results', 'seed': 950914547, 'device': None, 'bs_train': 128, 'bs_test': 32, 'batches': 500, 'epochs': 100, 'early_stopping': 3, 'checkpoints': None, 'lr': 0.001, 'reg': 0.001, 'pool_every': 8, 'hidden_dims': [4096, 1000], 'model_type': 'resnet', 'conv_params': {'kernel_size': 3, 'stride': [1, 1], 'padding': 1}, 'activation_type': 'relu', 'activation_params': {}, 'pooling_type': 'avg', 'pooling_params': {'kernel_size': 2}, 'batchnorm': True, 'dropout': 0.1, 'kw': {}}
In this part you will create your own custom network architecture based on the ConvClassifier you've implemented.
Try to overcome some of the limitations your experiment 1 results, using what you learned in the course.
You are free to add whatever you like to the model, for instance
Just make sure to keep the model's init API identical (or maybe just add parameters).
TODO: Implement your custom architecture in the YourCodeNet class within the hw2/cnn.py module.
# net = cnn.YourCodeNet((3,100,100), 10, channels=[32]*4, pool_every=2, hidden_dims=[100]*2)
# print(net)
# test_image = torch.randint(low=0, high=256, size=(3, 100, 100), dtype=torch.float).unsqueeze(0)
# test_out = net(test_image)
# print('out =', test_out)
Run your custom model on at least the following:
Configuratons:
K=[32, 64, 128] fixed with L=3,6,9,12 varying per run.So 4 different runs in total. To clarify, each run K takes the value of an array with a three elements.
If you want, you can add some extra runs following the same pattern. Try to see how deep a model you can train.
Naming runs:
Each run should be named exp2_L{}_K{}-{}-{}-{} where the braces are placeholders for the values. For example, the first run should be named exp2_L3_K32-64-128.
TODO: Run the experiment on the above configuration with the YourCodeNet model. Make sure the result file names are as expected. Use the following blocks to display the results.
plot_exp_results('exp2*.json')
No results found for pattern exp2*.json.
TODO Answer the following questions. Write your answers in the appropriate variables in the module hw2/answers.py.
from cs3600.answers import display_answer
import hw2.answers
Consider the bottleneck block from the right side of the ResNet diagram above. Compare it to a regular block that performs a two 3x3 convs directly on the 256-channel input (i.e. as shown in the left side of the diagram, with a different number of channels). Explain the differences between the regular block and the bottleneck block in terms of:
display_answer(hw2.answers.part3_q1)
1. A conv layer number of parameters is - kernel_hight X kernel_width X filters number in previous layer (+1) X filterss_number in current layer
So, in basic Residual Block- conv layer1&2 each have- 64 X 64 X 4 X 3 = 36928 Together - 36928 X 2 = 73856
In bottleneck Residual Block- conv layer 1- 64 X 256 X 2 X 1 = 16448 conv layer 2- 64 X 64 X 4 X 3 = 36928 conv layer 3- 256 X 64 X 2 X 1 = 16640 Together- 36928 + 16448 + 16640 = 70016
2. The number of floating point operations required to compute an output is- number of parameters of the block (calculated above) X width X height (of each output feature map. Basic Residual Block- 73856 X width X length Bottleneck Residual Block- 70016 X width X length
3. 1) Combine the input spatially within feature maps- The regular Residual Block has higher ability to combine the input within the feature map, since each output feature depends on more features (output of eachlayer depends on 9 features) than in bottleneck block (output of the first and last layers depends only on 1 feature). 2) Combine the input spatially across feature maps- In this case the Bottlenack Residual Blovk has higher ability to combine the input across the feature map, since it projects the input to a smaller channel number as the regular block and then projects it back to the original size. The regular Residual Block acts as a convolutional layer, thus changes the input feature map.
Analyze your results from experiment 1.1. In particular,
L for which the network wasn't trainable? what causes this? Suggest two things which may be done to resolve it at least partially.display_answer(hw2.answers.part3_q2)
1. We can see that in the case of K=32- the best results appeared with L=2,4 - for deeper net with L=8,16 there was no training at all.In the case of K=64-best results are still L=2,4 - in this case also L=8 got measurable results (but after twice epochs as with L=2,4). This can be explained by the affect of vanishing gradients as the net gets deeper, what causes it's learning ability to be poor. We can see that increasing the number of filters in each layer helps the training procedure of deeper networs. 2. In case of L=16 the network seems untrainable. We can help it b using ResBlocks or increasing the number of filters in each layer (as we saw for L=8 in the previous section).
Analyze your results from experiment 1.2. In particular, compare to the results of experiment 1.1.
display_answer(hw2.answers.part3_q3)
In experiment 1.2 we can see that L=4 got us better results that L=2,8. We can observe that for each L, there is different K that fits it the most. We can conclude from this section that using deeper and high filters number doesn't nessacarily gives us the best results.
Analyze your results from experiment 1.3.
display_answer(hw2.answers.part3_q4)
We can see that we got better results for L=1,2 that for L=3,4. This is for the same reason we saw before- as the network gets deeper the training gets harder (and sometimes omossible).
Analyze your results from experiment 1.4. Compare to experiment 1.1 and 1.3.
display_answer(hw2.answers.part3_q5)
With fixed K- we can clearly see that the accuracy gets lower as we increase the depth of the network. Different filters per layer (with skip connection)- With different filters number in each block, we can see that the training proccess is better even for deep networks, although we still see slightly better results for L=2,4 than L=8. We can parise the skip connection in this case that helps in regularizing the network and helps deeper layers to not reduce their gradients almost to zero.
YourCodeNet class.display_answer(hw2.answers.part3_q6)
Your answer: